Quantitative AI progress needs accurate and transparent evaluation

100 comments

·July 25, 2025

fsh

I believe that it may be misguided to focus on compute that much, and it would be more instructive to consider the effort that went into curating the training set. The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set. Many of the AI achievements would probably look a lot less miraculous if one could check the training data. The most crass example is OpenAI paying off the FrontierMath creators last year to get exclusive secret access to the problems before the evaluation [1]. Even without resorting to cheating, competition formats are vulnerable to this. It is extremely difficult to come up with truly original questions, so by spending significant resources on re-hashing all kinds of permutations of previous question, one will probably end up very close to the actual competition set. The first rule I learned about training neural networks is to make damn sure there is no overlap between the training and validation sets. It it interesting that this rule has gone completely out of the window in the age of LLMs.

[1] https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-lesso...

OtherShrezzing

> The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set. Many of the AI achievements would probably look a lot less miraculous if one could check the training data

I'm fairly certain this phenomenon is responsible for LLM capabilities on GeoGuesser type games. They have unreasonably good performance. For example, being able to identify obscure locations from featureless/foggy pictures of a bench. GeoGuesser's entire dataset, including GPS metadata, is definitely included in all of the frontier model training datasets - so it should be unsurprising that they have excellent performance in that domain.

ACCount36

People tried VLMs on "closed set" GeoGuessr-type tasks - i.e. non-Street View photos in similar style, not published anywhere.

They still kicked ass.

It seems like those AIs just have an awful lot of location familiarity. They've seen enough tagged photos to be able to pick up on the patterns, and generalize that to kicking ass at GeoGuessr.

YetAnotherNick

> GeoGuesser's entire dataset

No, it is not included, however there must be quite a lot of pictures on internet for most cities.. Geoguesser data is same as Google's street view data and it probably contains billions of 360 degree photos.

suddenlybananas

Why do you say it's not included? Why wouldn't they include it.

ivape

I just saw a video on Reddit where a woman still managed to take a selfie while being literally face to face with a black bear. There’s definitely way too much video training data out there for everything.

astrange

> The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set.

An irony here is that math blogs like Tao's might not be in LLM training data, for the same reason they aren't accessible to screen readers - they're full of math, and the math is rendered as images, so it's nonsense if you can't read the images.

(The images on his blog do have alt text, but it's just the LaTeX code, which isn't much better.)

alansammarone

As others have pointed out, LLMs have no trouble with LaTeX. I can see why one might think they're not - in fact, I made the same assumption myself sometime ago. LLMs, via transformers, are exceptionally good any _any_ sequence or one-dimensional data. One very interesting (to me anyway) example is base64 - pick some not-huge sentence (say, 10 words), base64-encode it, and just paste it in any LLM you want, and it will be able to understand it. Same works with hex, ascii representation, or binary. Here's a sample if you want to try: aWYgYWxsIEEncyBhcmUgQidzLCBidXQgb25seSBzb21lIEIncyBhcmUgQydzLCBhcmUgYWxsIEEncyBDJ3M/IEFuc3dlciBpbiBiYXNlNjQu

I remember running this experiment some time ago in a context where I was certain there was no possibility of tool use to encode/decode. Nowadays, it can be hard to certain whether there is any tool use or not, in some cases, such as Mistral, the response is quick enough to make it unlikely there's any tool use.

throwanem

I've just tried it, in the form of your base64 prompt and no other context, with a local Qwen-3 30b instance that I'm entirely certain is not actually performing tool use. It produced a correct answer ("Tm8="), which in a moment of accidental comedy it spontaneously formatted with LaTeX. But it did talk about invoking an online decoder, just before the first appearance of the (nearly) complete decoded string in its CoT.

It "left out" the A in its decode and still correctly answered the proposition, either out of reflexive familiarity with the form or via metasyntactic reasoning over an implicit anaphor; I believe I recall this to be a formulation of one of the elementary axioms of set theory, though you will excuse me for omitting its name before coffee, which makes the pattern matching possibility seem somewhat more feasible. ('Seem' may work a little too hard there. But a minimally more novel challenge I think would be needed to really see more.)

There's lots of text in lots of languages about using an online base64 decoder, and nearly none at all about decoding the representation "in your head," which for humans would be a party trick akin to that one fellow who could see a city from a helicopter for 30 seconds and then perfectly reproduce it on paper from memory. It makes sense to me that a model trained on the Internet would "invent" the "metaphor" of an online decoder here, I think. What in its "experience" serves better as a description?

prein

What would be a better alternative than LaTex for the alt text? I can't think of a solution that makes more sense, it provides an unambiguous representation of what's depicted.

I wouldn't think an LLM would have issue with that at all. I can see how a screen reader might, but it seems like the same problem faced by a screen reader with any piece of code, not just LaTex.

mbowcut2

LLMs are better at LaTeX than humans. ChatGPT often writes LaTeX responses.

neutronicus

Yeah, it's honestly one of the things they're best at!

I've been working on implementing some E&M simulations with Claude Code and it's so-so on the C++ and TERRIBLE at the actual math (multiplying a couple 6x6 matrix differential operators is beyond it).

But I can dash off some notes and tell Claude to TeXify and the output is great.

QuesnayJr

LLMs understand LaTeX extraordinarily well.

constantcrying

>(The images on his blog do have alt text, but it's just the LaTeX code, which isn't much better.)

LLMs are extremely good at outputting LaTeX, ChatGPT will output LaTeX, which the website will render as such. Why do you think LLMs have trouble understanding it?

astrange

I don't think LLMs will have trouble understanding it. I think people using screen readers will. …oh I see, I accidentally deleted the part of the comment about that.

But the people writing the web page extraction pipelines also have to handle the alt text properly.

MengerSponge

LLMs are decent with LaTeX! It's just markup code after all. I've heard from some colleagues that they can do decent image to code conversion for a picture of an equation or even some handwritten ones.

disruptbro

Language modeling is compression, whittle down graph to reduce duplication and data with little relationship: https://arxiv.org/abs/2309.10668

Let’s say everyone agrees to refer to one hosted copy of a token “cat”, and instead generate a unique vector to represent their reference to “cat”.

Blam. Endless unique vectors which are nice and precise for parsing. No endless copies of arbitrary text like “cat”.

Now make that your globally distributed data base to bootstrap AI chips from. The data driven programming dream where other machines on the network feed new machines boot strap.

American tech industry is IBM now. Stuck on recent success of web SaaS and way behind the plans of AI.

NitpickLawyer

The problem with benchmarks is that they are really useful for honest researchers, but extremely toxic if used for marketing, clout, etc. Something something, every measure that becomes a target sucks.

It's really hard to trust anything public (for obvious reasons of dataset contamination), but also some private ones (for the obvious reasons that providers do get most/all of the questions over time, and they can do sneaky things with them).

The only true tests are the ones you write yourself, never publish, and only work 100% on open models. If you want to test commercial SotA models from time to time you need to consider them "burned", and come up with more tests.

rachofsunshine

What makes Goodhart's Law so interesting is that you transition smoothly between two entirely-different problems the more strongly people want to optimize for your metric.

One is a measurement problem, a statement about the world as it is: an engineer who can finish such-and-such many steps of this coding task in such-and-such time has such-and-such chance of getting hired. The thing you're measuring isn't running away from you or trying to hide itself, because facts aren't conscious agents with the goal of misleading you. Measurement problems are problems of statistics and optimization, and their goal is a function f: states -> predictions. Your problems are usually problems of inputs, not problems of mathematics.

But the larger you get, and the more valuable gaming your test is, the more you leave that measurement problem and find an adversarial problem. Adversarial problems are at least as difficult as your adversary is intelligent, and they can sometimes be even worse by making your adversary the invisible hand of the market. You don't live in the world of gradient descent anymore, because the landscape is no longer fixed. You now live in the world of game theory, and your goal is a function f: (state) x (time) x (adversarial capability) x (history of your function f) -> predictions.

It's that last, recursive bit that really makes adversarial problems brutal. Very simple functions can rapidly result in extremely deep chaotic dynamics once you allow even the slightest bit of recursion - even very nice functions like f(x) = 3.5x(1-x) become writhing ergodic masses of confusion.

pixl97

I would also assume Russell's paradox needs added in here too. Humans can and do hold sets of conflicting information, it is my theory that conflicts have an informational/processing cost to manage. In benchmark gaming you can optimize the processing speed by removing the conflicting information but you lose real world reliability metrics.

visarga

Well said, the problem with recursion is that it constructs its own context as it goes, rewrites its rules, and you cannot predict it statically, without forward execution. It's why we have the halting problem. Recursion is irreducible. A benchmark is a static dataset, it does not capture the self constructive nature of recursion.

bwfan123

nice comment, a reason why ML approaches may struggle in trading markets where other agents are also competing with you possibly using similar algos. or self-driving which involves other agents who could be adversarial. just training on past data is not sufficient as existing edges are competed away and new edges keep arising out of nowhere.

crocowhile

There is also a social issue that has to do with accountability. If you claim your model is the best and then it turns out you overfitted the benchmarks and it's actually 68th, your reputation should suffer considerably for cheating. If it does not, we have a deeper problem than the benchmarks.

mmcnl

Yes, I ignore every news article about LLM benchmarks. "GPT 7.3o first to reach >50% score in X2FGT AGI benchmark" - ok thanks for the info?

antupis

Also, even if you want to be honest, at this point, probably every public or semipublic benchmark is part of CommonCrawl.

NitpickLawyer

True. And it's even worse than that, because each test probably gets "talked about" a lot in various places. And people come up with variants. And those variants get ingested. And then the whole thing becomes a mess.

This was noticeable with the early Phi models. They were originally trained fully on synthetic data (cool experiment tbh) but the downside was that GPT3 / 4 was "distilling" benchmarks "hacks" into it. It became aparent when new benchmarks were released, after the published date, and there was one that measured "contamination" of about 20+%. Just from distillation.

ACCount36

Your options for evaluating AI performance are: benchmarks or vibes.

Benchmarks are a really good option to have.

klingon-3

> It's really hard to trust anything public

Just feed it into an LLM, unintentionally hint at your bias, and voila, it will use research and the latest or generated metrics to prove whatever you’d like.

> The only true tests are the ones you write yourself, never publish, and only work 100% on open models.

This may be good enough, and that’s fine if it is.

But, if you do it in-house in a closet with open models, you will have your own biases.

No tests are valid if all that ever mattered was the argument and perhaps curated evidence.

All tests, private and public tests have proved flawed theories historically.

Truth has always been elusive and under siege.

People will always just believe things. Data is just foundation for pre-existing or fabricated beliefs. It’s the best rationale for faith, because in the end, faith is everything. Without it, there is nothing.

pu_pe

> For instance, if a cutting-edge AI tool can expend $1000 worth of compute resources to solve an Olympiad-level problem, but its success rate is only 20%, then the actual cost required to solve the problem (assuming for simplicity that success is independent across trials) becomes $5000 on the average (with significant variability). If only the 20% of trials that were successful were reported, this would give a highly misleading impression of the actual cost required (which could be even higher than this, if the expense of verifying task completion is also non-trivial, or if the failures to solve the goal were correlated across iterations).

This is a very valid point. Google and ChatGPT announced they got the gold medal with specialized models, but what exactly does that entail? If one of them used a billion dollars in compute and the other a fraction of that, we should know about it. Error rates are equally important. Since there are conflicts of interest here, academia would be best suited for producing reliable benchmarks, but they would need access to closed models.

sojuz151

Compute has been getting cheaper and models more optimised. So if models can do something it will not be long till they can do this cheap.

EvgeniyZh

GPU compute per watt has grown by a factor of 2 in last 5 years

moffkalast

> with specialized models

> what exactly does that entail

Overfitting on the test set with models that are useless for anything else, that's what.

JohnKemeny

Don't put Google and ChatGPT in the same category here. Google cooperated with the organizers, at least.

spuz

Could you clarify what you mean by this?

raincole

Google's answers were judged by IMO. OpenAI's were judged by themselves internally. Whether it matters is up to the reader.

EnnEmmEss

TheZvi had a summarization of this here: https://thezvi.substack.com/i/168895545/not-announcing-so-fa...

In short (there is nuance), Google cooperated with the IMO team while OpenAI didn't which is why OpenAI announced before Google.

ml-anon

Also neither got a gold medal. Both solved problems to meet the threshold for a human child getting a gold medal but it’s like saying an F1 car got a gold medal in the 100m sprint at the Olympics.

bwfan123

The popular science title was funnier with a pun on "mathed" [1]

"Human teens beat AI at an international math competition Google and OpenAI earned gold medals, but were still out-mathed by students."

[1] https://www.popsci.com/technology/ai-math-competition/

nmca

Indeed, it’s like saying a jet plane can fly!

vdfs

"Google F1 Preview Experimental beat the record of the fastest man on earth Usain Bolt"

ozgrakkurt

Out of topic but just opening link and actually being able to read the posts and go to profile on a browser, without an account, feels really good. Opening a mastadon profile, fk twitter

ipnon

Stallman was right all along.

mhl47

Side note: What is going on with these comments on Mathstodon? From moon landing denials, to insults, allegations that he used AI to write this ... almost all of them are to some capacity insane.

dash2

Almost everywhere on the internet is like this. It's hn that is (mostly!) exceptional.

f1shy

The “mostly” there is so important! But also HN suffers from other problems (see in this thread the discussion about over policing comments, and calling fast hyperbolic and inflammatory).

And don’t get me started in the decline on depth in technical topics and soaring in political discussions. I came to HN for the first, not the second.

So we are humans, there will never be a perfect forum.

frumiousirc

> So we are humans, there will never be a perfect forum.

Perfect is in the eye of the moderator.

Karrot_Kream

I find the same kind of behavior on bigger Bluesky AI threads. I don't use Mathstodon (or actively follow folks on it) but I certainly feel sad to see similar replies there too. I speculate that folks opposed to AI are angry and take it out by writing these sorts of comments, but this is just my hunch. That's as much as I feel I should write about this without feeling guilty for derailing the discussion.

ACCount36

No wonder. Bluesky is where insane Twitter people go when they get too insane for Twitter.

null

[deleted]

andrepd

Have you opened a twitter thread? People are insane on social media, why should open source social media be substantially different? x)

f1shy

I refrain from any of those X, mastodon, etc. so let me ask a question:

are all equally bad? Or same bad but a different aspect? E.g. I read often here that X has more disinformation, and right wing propaganda, while mastodon here was called out on another topic.

Maybe somebody active in different networks can answer that.

fc417fc802

Moderation and the algorithms used to generate user feeds both have strong impacts. In the case of mastodon (ie activitypub) moderation varies wildly between different domains.

But in general, I'd say that the microblogging format as a whole encourages a number of toxic behaviors and interaction patterns.

miltonlost

X doesn't let you use trans as a word and has Grok spewing right-wing propaganda (mechahitler?). That self-selects into the most horrible people being on X now.

nurettin

That is how peak humanity looks like.

hshshshshsh

The truth is, both deniers and believers are operating on belief. Only those who actually went to the Moon know firsthand. The rest of us trust information we've received — filtered through media, education, or bias. That makes us no fundamentally different from deniers; we just think our belief is more justified.

esafak

Some beliefs are more supported by evidence than others. To ignore this is to make the concept of belief practically useless.

hshshshshsh

Yeah. My point is you have not seen any of the evidence. You just have belief that evidence exists. Which is a belief and not evidence.

fc417fc802

Just to carry this line of reasoning out to the extreme for entertainment purposes (and to illustrate for everyone how misguided it is). Even if you perform a task firsthand, at the end of the day you're just trusting your memory of having done so. You feel that your trust in your memory is justified but fundamentally that isn't any different from the deniers either.

hshshshshsh

This is actually true. Plenty of accidents has happened because of this.

I am not saying trusting your memory is always false or true. Most of the times it might be true. It's a heuristic.

But if someone comes and deny what you did, the best course of action would be to consider the evidence they have and not assume they are stupid because they believe differently.

Let's be honest, you have not personally went and verified the rocks belongs to Moon. Nor were you tracking the telemetry data in your computer when the rocket was going to Moon.

I also believe we went to Moon.

But all I have is beliefs.

Everyone believed Early was flat 1000s years back as well. They had solid evidence.

But the humility is accepting you don't know and you are believing and not pretend you are above others who believe exact opposite..

pama

This sounds very reasonable to me.

When considering top tier labs that optimize inference and own the GPUs: the electricity cost of USD 5000 at a data center with 4 cents per kWh (which may be possible to arrange or beat in some counties in the US with special industrial contracts) can produce about 2 trillion tokens for the R1-0528 model using 120kW draw for the B200 NVL72 hardware and the (still to be fully optimized) sglang inference pipeline: https://lmsys.org/blog/2025-06-16-gb200-part-1/

Although 2T tokens is not unreasonable for being able to get high precision answers to challenging math questions, such a very high token number would strongly suggest there are lots of unknown techniques deployed at these labs.

If one adds the cost of GPU ownership or rental, say 2 USD/h/GPU, then the number of tokens for 5k USD shrinks dramatically to only 66B tokens, which is still high for usual techniques that try to optimize for a best single answer in the end, but perhaps plausible if the vast majority of these are intermediate thinking tokens and a lot of the value comes from LLM-based verification.

ipnon

Tao’s commentary is more practical and insightful than all of the “rationalist” doomers put together.

jmmcd

(a) no it's not

(b) your comment is miles off-topic, as he is not addressing doom in any sense

Quekid5

That seems like a low bar :)

ipnon

My priors do not allow the existence of bars. Your move.

tempodox

You would have felt right at home in the time of the Prohibition.

ks2048

I agree about Tao in general, but here,

> AI technology is now rapidly approaching the point of transition from qualitative to quantitative achievement.

I don't get it. The whole history of deep learning was driven by quantitative achievement on benchmarks.

I guess the rest of the post is about adding emphasis on costs in addition to overall performance. But, I don't see how that is a shift from qualitative to quantitative.

raincole

He means people in this AI hype trend mostly focused on "now AI can do a task that was impossible mere 5 years ago", but we will gradually change our perception of AI to "how much energy/hardware cost to complete this task and does it really benefit us."

(My interpretation, obviously)

paradite

I believe everyone should run their own evals on their own tasks or use cases.

Shameless plug, but I made a simple app for anyone to create their own evals locally:

https://eval.16x.engineer/

kristianp

It's going to take a large step up in transparency for AI companies to do this. It was back in gpt 4 days that openai stopped reporting model size for example and the others followed suit.

stared

I agree that after a challenge is something can be done at all (heavier-than-air flight, Moon landing, Gold medal at the IMO) then next question is makes sense economically.

I like ARC-AGI approach for the reason that it shows both axes - score and price, and place human benchmark on these.

https://arcprize.org/leaderboard

js8

LLMs could be very useful in formalizing the problem and assumptions (conversion from natural language), but once problem is described in a formal way (it can be described in some fuzzy logic), then more reliable AI techniques should be applied.

Interestingly, Tao mentions https://teorth.github.io/equational_theories/, and I believe this is better progress than LLMs doing math. I believe enhancing Lean with more tactics and formalizing those in Lean itself is a more fruitful avenue for AI in math.

agentcoops

I used to work quite extensively with Isabelle and as a developer on Sledgehammer [1]. There are well-known results, most obviously the halting problem, that mean fully-automated logical methods applied to a formalism with any expressive capability, i.e. that can be used to formalize non-trivial problems, simply can never fulfill the role you seem to be suggesting. The proofs that are actually generated in that way are, anyway, horrendous -- in fact, the problem I used to work on was using graph algorithms to try and simplify computer-generated proofs for human comprehension. That's the very reason that all the serious work has previously been on proof /assistants/ and formal validation.

LLMs, especially in /conjunction/ with Lean for formal validation, are really an exciting new frontier in mathematics and it's a mistake to see that as just "unreliable" versus "reliable" symbolic AI etc. The OP Terence Tao has been pushing the edge here since day one and providing, I think, the most unbiased perspective on where things stand today, strengths as much as limitations.

[1] https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehamm...

js8

LLMs (as well as humans) are algorithms like anything else and so they are also subject to halting problem. I don't see what LLMs do that couldn't be in principle formalized as a Lean tactic. (IMHO LLMs are just learning rules - theorems of some kind of fuzzy logic - and then try to apply them using heuristic search to satisfy the goal. Unfortunately the rules learned are likely not fully consistent and so you get reasoning errors.)

data_maan

The concept of pre-registered eval (an analogy to pre-registered study) will go a long way towards fixing this.

More information

https://mathstodon.xyz/@friederrr/114881863146859839