IQ Tests Results for AI
237 comments
·August 17, 2025gpt5
technothrasher
> The way human IQ testing developed is that researchers noticed people who excel in one cognitive task tend to do well in others
My son took an IQ test and it wouldn't score him because he breaks this assumption. He was getting 98% in some tasks and 2% in others. The psychologist giving him the test said it was unlikely enough pattern that they couldn't get an IQ result for him. He's been diagnosed with non-verbal learning disability, and this is apparently common for nvld folks.
Retric
IMO g is purely an abstraction. As long as the rate you learn most things is within a reasonable bound spending more or less time learning/perfecting X impacts the time you spend on Y, resulting in people being generally more or less proficient in a huge range of common cognitive skills. Thus, testing those general skills is normally a proxy for a wide range of things.
LD breaks IQ because it results in noticeably uneven skill acquisition in even foundational skills. Meanwhile increasing levels of specialization reward being abnormally good at a very narrow sets of skills making IQ less significant. The #1 rock climber in the world gets sponsors, the 100th gets a hobby.
moritonal
Just to add an anecdote, as of 15 years ago I had similar scores and was diagnosed with dyslexia.
whatshisface
A rule whose exceptions are one in a hundred fails for three million Americans.
alistairSH
And works for 290 million plus…
erikerikson
A term of use for your son is twice exceptional. The GP is correct about the theoretical basis of the tests. Note the use of "tend" in the quote. Even those who fit that better tend to have differential strengths so that has shown to be too simple. Over time the models of intelligence have complected adding EQ (emotional quotient), SQ (social q...), and so on but IQ was first, continues to be considered useful in some ways even as it's also been considered an oppression by some.
Workaccount2
EQ, SQ and whatever other-q, are not really a true thing. They're more feel good tests for Facebook dwellers who get confused on IQ tests.
There are social assessments, but they are for identifying disorders.
PaulHoule
I've maxxed out any test or subscale of verbal intelligence that I've taken since I was 12 or so but my schizotaxic brain glitches enough that my problem solving ability in test environments is a bit degraded, still at least an SD above average, but enough that I get an IQ test score that is just high and not off the charts.
alphazard
IQ is a discovery about how intelligence occurs in humans. As you mentioned, a single factor explains most of the performance of a human on an IQ test, and that model is better than theories of multiple orthogonal intelligences. To contrast, 5 orthogonal factors are the best model we have for human personality.
The first question to ask is "do LLMs also have a general factor?". How much of an LLMs performance on an IQ test can be explained by a single positive correlation between all questions? I would expect LLMs to perform much better on memory tasks than anything else, and I wouldn't be surprised if that was holding up their scores. Is there a multi factor model that better explains LLM performance on these tests?
og_kalu
>The first question to ask is "do LLMs also have a general factor?".
Yes, there is some research about it here - https://www.sciencedirect.com/science/article/pii/S016028962...
naveen99
Some points on the 4 ? or 5 dimensional personality space correlate with higher iq though.
alphazard
That may be the case. The personality traits are mostly uncorrelated with one another.
I was trying to give an example of what a successful multi factor model looks like (the Big 5) to then contrast it with a multi factor model that doesn't work well (theories of multiple intelligences).
YetAnotherNick
ARC-AGI challenge aims for that. In fact the objective is even more strict that the tasks must be trivial for most humans given time.
nsoonhui
Another component of this theory concerning g is that it's largely genetic, and immune to "intervention" AKA stability as you mentioned. See the classic "The Bell Curve" for a full exposition.
Which makes me wonder what's the point of all the intervention in the form of teaching/parenting styles and whatnot, if g factor is nature and immutable by large? What's the logic of the educators here?
matthewdgreen
If IQ potential was 50% genetic, then the teaching would potentially raise your actual IQ by affecting the other 50% which is huge. IQs scores in populations and individuals change based on education, nutrition, etc. But even if we hypothesized a pretend world where “g” was magically 100% genetic, this (imaginary) measure is just potential. It is not true that an uneducated, untrained person will be able to perform tasks at the level of an educated, trained person. Also The Bell Curve was written by a political operative to promote ideological views, and is full of foundational errors.
clove
You've made the mistake of thinking that if IQ were 50% genetic (it isn't - it's way more than that, but that's beside the point), then the remaining proportion is completely (non-shared) environmental.
Researchers in this field actually break down the non-genetic component into four major components: 1. Shared environment 2. Non-shared environment 3. Error
Shared environment accounts for so little variance that it might as well be ignored; while non-shared accounts for little more than error.
Note that including error in the non-genetic component, just as you've done in your post above, you are viscerally downplaying the otherwise undeniably predictive link from genes to IQ. In other words, whatever number you give is automatically deflated due to the way a psychometric is measured.
This has never been the source of debate. Back when I was going through grad school in intelligence, people didn't have to overthink how they presented the data. Intelligence was already a mature field, and we discussed the data openly. But in the past couple decades or so, a lot of people such as yourself popped up, attempting to craft irrelevant, statistically incorrect arguments against the results of certain well-established psychometrics that happen to not fit within whatever mental world your brand of politics ascribes.
If you really cared about the data, you'd be discussing the numbers. But your interest in this previously niche topic isn't in understanding reality; it's in justifying your worldview, which is why you deny the established data, immediately present a caveat stating that the data doesn't matter in the first place, appeal to emotions, and finish it all off by claiming those who disagree with you have been brainwashed. None of those four arguments have any merit in a genuine discussion on this topic.
null
ok_computer
Lower and median IQ people still benefit from literacy, numeracy, and art to function in society. The point of education systems isn’t to boost individuals’ dimensionally reduced 1D metrics but rather enrich their lives and contributions to society. There will always be distributions of abilities and means but that doesn’t justify neglecting the bulk of tax paying people.
jacquesm
It used to be that society worked just fine with people of all grades of smarts. But we're rapidly getting to the point that to be able to earn a living wage you need to be above average, especially if you are sole income provider for a while. AI is further steepening that S curve's mid-section.
jdietrich
G is (largely) immutable, but knowledge and skills are not. The economy is not zero-sum and we all benefit from increasing the total amount of human capital. Unfortunately, thinking around education is dominated by people who wrongly believe that the economy is zero-sum.
api
Throughout 99% of human history, the economy was mostly zero sum. If someone was rich it was because they stole it. If a group was wealthier it was because they stole it. By "stole it" I'm including theft of labor through slavery as well as the usual conquests and raiding. There was little to no innovation over time spans as long as thousands of years.
Most of human culture and philosophy evolved during these periods and bakes in the idea that the pie is finite and that anyone with more of it has stolen it, because that was just an accurate picture of reality.
A growing pie is a rare condition. It has happened a few times during periods of high civilization: Egypt, Greece, and Rome in the West and similar examples exist in lots of other places.
A rapidly growing pie is entirely new. The modern world is an extreme historical aberration built on the scientific method, modern engineering methods, and the discovery of massive amounts of exploitable cheap energy in the form of first fossil fuels, then nuclear power, then (today) learning to exploit things like solar and wind energy at exponentially larger scales. Other innovations that have fed into this unique condition include synthetic fertilizers, antibiotics, vaccines, etc.
Humans have never lived in an environment like this. Everything in our evolution and our accumulated culture is screaming that it's wrong -- that it will either collapse tomorrow (hence the perennial popularity of doomerism) or that it must be built on some kind of insanely massive crime because otherwise where is all this wealth coming from? That's because it's impossible. It cannot be. The idea that wealth can be created at this scale is just... not a thing that has ever existed until maybe 200 years ago max but really more like 80-100. Before that there was only subsistence and theft.
Edit: I'm not arguing that there is no slavery or near-slavery or theft/conquest in the modern world. These things certainly still happen. I'm arguing that it is not the primary source of our massive wealth. Slavery and conquest have always been around and no society has ever been this wealthy or grown this fast. Not even close.
cedilla
"The Bell Curve" is, let's say, highly controversial and not a good introduction into the topic. Its claim that genetics are the main predictor of IQ, which was very weakly supported at the time, has been completely and undeniably refuted by science in the thirty years since it's publication.
alphazard
This is misleading. Anyone who wants to learn about IQ should Google it. It's the most replicated finding in psychology, and any questions you have about twins or groups with similar or different genes have probably been investigated. There is a lot of noise online in the form of commentary about IQ, so it's important to look at actual data if you are skeptical/curious.
holbrad
> has been completely and undeniably refuted by science in the thirty years since it's publication.
This is literally the exact opposite.
nialse
Do note that The Bell Curve is not considered controversial in general. The part about race and genetics is. Also genes being the sole predictor of IQ is not an accurate description of the book’s premise.
fortran77
I've read the book. I says that aspects of IQ can be heritable, but doesn't ever say "genetics are the main predictor of IQ".
Quoting direcly from the book: "It seems highly likely to us that both genes and the environment have something to do with IQ differences” and the book states that "the exact contribution of genes versus environment is unknown."
hemabe
And yet, in the US, the first start-ups are offering the possibility of testing embryos for their IQ.
https://www.theguardian.com/science/2024/oct/18/us-startup-c...
brabel
Really? If not genetics then what is it? Just random??
BoingBoomTschak
How much of that controversy is manufactured, though? https://en.wikipedia.org/wiki/The_IQ_Controversy,_the_Media_... (if even Wikipedia can't drag this through the mud...)
A4ET8a8uTh0_v2
It is not controversial at all. It was deemed inappropriate due the amount of 'wrongthink' it causes. We can argue about what followed and whether its claims have been nullified, but given how much conversations started with it, I sincerely doubt the argument that it should not be the introduction is reasonable. In a sense, it is the source of the debate.
lukan
Assuming the assumption it is true (which I doubt) - there obviously is still value in teaching knowledge, so making students know more and practical skills, not produce more intelligent students.
You can have a IQ of over 200, but if no one ever showed you how a computer works or gives you a manual, you still won't be productive with it.
But I very much believe, intelligence is improvable and also degradable, just ask some alcoholics for instance.
pama
IQ of 200 (or higher) do not exist according to the original definitions of this metric. You need a population of 219 billion or higher to have a 95% chance that a sample exists with 6.66 standard deviations away from the mean (assuming mean of 100 and std of 15). Ofc the tests are of limited value and things can be gamed, but it would be silly to try and identify samples that have no chance of existing.
twodave
One’s ceiling may be more or less stable, but there are many instances where individuals have certain underdeveloped cognitive skills (of which there are a litany), undergo training to develop those skills, and then afterwards go on to score (sometimes much) higher on IQ tests. Children with certain disabilities such at autism or FASD tend to see more dramatic differences. This isn’t to say they “became more intelligent,” but rather that the testing is unable to measure intelligence directly, rather relying on those certain cognitive skills as a proxy for intelligence.
conradev
Intelligence is not knowledge and it is not wisdom. You have to “learn” to get those.
It’s much more akin to VO2 max in aerobic exercise, something like 70% genetic. It is still good for everyone to exercise even if it is harder or easier for some.
csa
> Which makes me wonder what's the point of all the intervention in the form of teaching/parenting styles and whatnot, if g factor is nature and immutable by large? What's the logic of the educators here?
Many/most people (esp. young people) are not pushed to the limits of their capacity to learn.
Quality interventions guide people closer to these limits.
null
krapp
I imagine the value of something like this is for business owners to choose which LLMs they can replace their employees with, so it using human IQ tests is relevant.
azernik
The point is that the correlation between doing well on these tasks and doing well on other (directly useful) tasks is well established for humans, but not well established for LLMs.
If the employees' job is taking IQ tests, then this is a great measure for employers. Otherwise, it doesn't measure anything useful.
bbarnett
Otherwise, it doesn't measure anything useful.
Oh it measures a useful metric, absolutely, as aspects of an IQ test validate certain types of cognition. Those types of cognition have been found to map to real-world employment of the same.
If an AI is so incapable of performing admirably on an IQ test for those types of cognition, then one thing we're certainly measuring is that it's incapable of handling that 'class' of cognition if the conditions change in minuscule and tiny ways.
And that's quite important.
For example, if the model appears to perform specific work tasks well, related to a class of cognition, then cannot do the same category of cognitive tasks outside of that scope, we're measuring lack of adaptability or true cognitive capability.
It's definitely measuring something. Such as, will the model go sideways with small deviations on task or input? That's a nice start.
VoodooJuJu
[dead]
sigmoid10
Big caveat here:
This website's method doesn't work at all for humans the way it works for LLMs. For humans, there is a strict time limit on these IQ tests (at least in officially recognised settings like Mensa). This kind of sequence completion is mostly a question of how fast your brain can iterate on problems. Being able to solve more questions within the time limit means you get a higher score because your brain essentially switches faster. But for LLMs, they just give them all the time in the world in parallel and see how many questions they can solve at all. If you look at the examples, you'll see some high end models struggling with some the first questions, that most humans would normally get easily. Only the later ones get hard where you really have to think through multiple options. So a 100 IQ LLM in here is not technically more intelligent in IQ test questions than 50% of humans.
If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.
abullinan
Mensa really needs to be left out of these discussions. It’s not scientific, it is just a money grab for people who need intellectual validation. You can be admitted with a top 10% SAT score and no in-person testing at all. The in-person testing is in three parts, one part is a memory test, the second part is a Mensa test, the third part is the Weschler test. Source: I joined in 1995 because I needed intellectual validation. :)
leopoldj
The point of this is not so much to compare humans with AI. But to compare AI with other traditional software development approaches to solve this domain (IQ test, in this case). I believe, and I could be wrong, it will be nearly impossible, or too expensive, to develop deterministic software to beat AI in IQ test.
nerevarthelame
I agree that it's wrong to do so, but the maintainer of this site certainly thinks that the point is to compare humans with AI. He frequently compares the results to human IQ test takers without any sort of caveats: "Now o3 scores an IQ of 116, putting it in the top 15% of humans. The median Maximum Truth reader, for comparison, scored 104." [0]
0: https://www.maximumtruth.org/p/skyrocketing-ai-intelligence-...
mdp2021
But when an LLM can fail though having all the time in the world, you are pretty certain you hit a wall.
So, in a way you have defined a good indicator for a limit for a certain area.
sigmoid10
There is not enough sampling here to reach this conclusion. Remember, you can crank things like o3 pretty high on tasks like ARC AGI if you're willing to spend thousands of dollars on inference time compute. But that's obviously not in the budget for an enthusiast site like this.
mdp2021
Sure but, you wrote:
> If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.
You interpreted "smarter" the IQ way: results constrained time. But we actually get an indicator about the ability of the LLM to be able to reach, given time, the result or not - that is the interpretation of "smarter" that many of us need.
(Of course, it remains to be seen whether the ability to achieve those contextual results exports as an ability relevant to the solutions we actually need.)
mutkach
Judging from the reasoning trace for the problem of the day - almost all of the models obviously had some presence of IQ training data or at least it could be said that the models are very biased in a beneficial way. From the beginning of the trace you kinda see that the model had already "figured it out" - the reasoning is done only for applying the basic arithmetics.
None of the models did actually "reason" about what the problem could possibly be - like none of them considered that more intricate patterns are possible in a 3x3 grid (having taken this kinds of test earlier in life, I still had a few seconds of indecision, thinking whether this is the same kind of test that I've seen and not some more elaborate one), and none of them tried solving the problem column-wise (it is still possible by the way) - personally, I think that indicates a strong bias present in the pretraining. For what it's worth, I would consider a model that would come up with at least a few different interpretations of the pattern while "reasoning" to be the most intelligent one - irrespective of the correctness of the answer.
amunozo
Babe wake up. New benchmark to overfit models just dropped.
testdelacc1
They’re definitely going to overfit on this, but this will be much better from a marketing perspective. Normies don’t know wtf an MMLU is, but they do know what IQ is and that 140 is a big number.
Can’t wait for CEOs to start saying “why would we hire a 120 IQ person who works 9-5 with a lunch break when we can hire a 170 IQ worker who works 24x7 for half the cost??”
notahacker
"Workers rejoice as model overfitted to score 170 on IQ test turns out to be incapable of performing basic tasks..."
kcplate
There are a lot of people with high IQs that appear to be incapable of performing basic tasks too
CamperBob2
Nothing matters except the first couple of time derivatives. The workers aren't getting any better.
scotty79
They have offline test that's supposedly not in the training data. It gets lower scores but best one is still 120 IQ.
jedberg
The more interesting link (to me) is this one: https://www.trackingai.org/political-test
They run each model through the political leaning quiz.
Spoiler alert: They all fall into the Left/Liberal box. Even Grok. Which I guess I already knew but still find interesting.
LNSY
There's a way to fix this political bias: feed it a bunch of bad code
https://www.quantamagazine.org/the-ai-was-fed-sloppy-code-it...
It's almost as if altruism and equality are logical positions or something
dwroberts
> Note: VERBAL models are asked using the verbalized test prompt. VISION models are asked the test image instead without any text prompts.
Just glancing at the bar graphs, the vision models mostly suck across the board for each question. Whereas verbal ones do OK.
And today's example of clock faces (#17) does a good job of demonstrating why: because when a lot of the diagrams are explained verbally, it makes it significantly easier to solve.
Maybe it's just me, but #17 for example - it's not immediately obvious those are even supposed to represent clocks, and yet the verbal prompt turns each one into clock times for the model (e.g. 1:30) which feels like 50% of the problem being solved before the model does anything at all.
jonahx
Doesn’t training data pollution largely invalidate the usefulness of this benchmark?
ahmedhawas123
Would be great to add a few human benchmarks on this (e.g., average US IQ, Ivy League average, human 80th percentile). Also understanding some IQ per cost metric could be fun.
Overall this is fun but not sure anyone in their right mind will be selecting an LLM based on this IQ benchmark
jedberg
The IQ curve is designed so that 100 is average. So two of your questions can be answered with math:
> average US IQ
100
>human 80th percentile
113
The last one, Ivy League average, can be guessed based on published data. The median SAT score of an Ivy League attendee is 1500. A 1500 on the SAT is roughly an IQ of 130-140.
So in theory, the median Ivy League attendee is at the genius level.
Gimpei
I’m surprised the score isn’t higher. What’s to stop an LLM from training on the complete corpus of IQ tests. I assume they’d get perfect scores
pico303
I was thinking that too. I wouldn’t even trust that the “offline” tests didn’t have the questions and answers posted online somewhere. This might really be an analysis of how extensive the dataset is for each LLM, not how much smarter one LLM is from another.
ComputerGuru
Really curious that o4-mini scores (slightly) higher than o4-mini-high.
For whatever this benchmark is worth, it's yet another metric showing Gemini 2.5 Pro is really one of the best all-around models (despite being a bit older now), and available without a subscription.
while_true_
On the MENSA IQ test GPT-Pro got 34 out 35 correct for an IQ of 148. Very good. Rumor has it the one question it missed had something to do with instances of "b" in "blueberry."
jonplackett
Really need to use a CDN before you get #1 on HN
charles_f
I'm on a shared hosting instance with relatively low resource allocation but reasonable bandwidth, and made #1 several times while never having issues loading. As long as your content is static and doesn't generate load on your server, you should be fine serving a lot of concurrent requests. Issues start when serving content relies on a database, or you serve large content
diggan
I mean not really, as always you just need to make sure you're not doing 10s of dynamic calls for each page load and if you do, add some minute-long cache at least. Most of the stuff that gets hugged to death really shouldn't, most of the times it's just static content that is trivial to host on even $10/month instances.
FranOntanaya
The amount of calls on some pages displaying the simplest stuff is mind-boggling. 160 requests for a page just displaying a HTML5 video and a title, 360 requests for a Reddit page, it's nuts. We don't need to be like this.
yetihehe
"We and our 350 partners care about your privacy".
stared
It was not my intention to bring the HN hug of death.
(For a reference, I shared a link, I am not the author.)
habibur
caching is the solution. don't serve dynamic content w/o html caching.
kator
LLM vibe coded site and architecture?
mirekrusin
How many microservices, sql joins, distributed kafka piplelines etc. we currently recommend for serving static, public article?
scotty79
Dumping things on Cloudflare is clever architecture now?
null
ekianjo
not if you have a static site
The way human IQ testing developed is that researchers noticed people who excel in one cognitive task tend to do well in others - the “positive manifold.”
They then hypothesized a general factor, “g,” to explain this pattern. Early tests (e.g., Binet–Simon; later Stanford–Binet and Wechsler) sampled a wide range of tasks, and researchers used correlations and factor analysis to extract the common component, then norm it around 100 with a SD of 15 and call it IQ.
IQ tend to meaningfully predicts performance across some domains especially education and work, and shows high test–retest stability from late adolescence through adulthood. It is also tend to be consistent between high quality tests, despite a wide variety of testing methods.
It looks like this site just uses human rated public IQ tests. But it would have been more interesting if an IQ test was developed specifically for AI. I.e. a test that would aim to Factor out the strength of a model general cognitive ability across a wide variety of tasks. It is probably doable by doing principal component analysis on a large set of benchmarks available today.