Skip to content(if available)orjump to list(if available)

FrontierMath was funded by OpenAI

FrontierMath was funded by OpenAI

104 comments

·January 19, 2025

agnosticmantis

“… we have a verbal agreement that these materials will not be used in model training”

Ha ha ha. Even written agreements are routinely violated as long as the potential upside > downside, and all you have is verbal agreement? And you didn’t disclose this?

At the time o3 was released I wrote “this is so impressive that it brings out the pessimist in me”[0], thinking perhaps they were routing API calls to human workers.

Now we see in reality I should’ve been more cynical, as they had access to the benchmark data but verbally agreed (wink wink) not to train on it.

[0: https://news.ycombinator.com/threads?id=agnosticmantis#42476... ]

jerpint

You can still game a test set without training on it, that’s why you usually have a validation set and a test set that you ideally seldom use. Routinely running an evaluation on the test set can get the humans in the loop to overfit the data

asadotzler

OpenAI doesn't respect copyright so why would they let a verbal agreement get in the way of billion$

Rebuff5007

Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

pseudo0

Their argument is that using copyrighted data for training is transformative, and therefore a form of fair use. There are a number of ongoing lawsuits related to this issue, but so far the AI companies seem to be mostly winning. Eg. https://www.reuters.com/legal/litigation/openai-gets-partial...

Some artists also tried to sue Stable Diffusion in Andersen v. Stability AI, and so far it looks like it's not going anywhere.

In the long run I bet we will see licensing deals between the big AI players and the large copyright holders to throw a bit of money their way, in order to make it difficult for new entrants to get training data. Eg. Reddit locking down API access and selling their data to Google.

ThrowawayR2

The FSF funded some white papers a while ago on CoPilot: https://www.fsf.org/news/publication-of-the-fsf-funded-white.... Take a look at the analysis by two academics versed in law at https://www.fsf.org/licensing/copilot/copyright-implications... starting with §II.B that explains why it might be legal.

Bradley Kuhn also has a differing opinion in another whitepaper there (https://www.fsf.org/licensing/copilot/if-software-is-my-copi...) but then again he studied CS, not law. Nor has the FSF attempted AFAIK to file any suits even though they likely would have if it were an open and shut case.

Filligree

A lot of people want AI training to be in breach of copyright somehow, to the point of ignoring the likely outcomes if that were made law. Copyright law is their big cudgel for removing the thing they hate.

However, while it isn't fully settled yet, at the moment it does not appear to be the case.

AdieuToLogic

> Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

"Move fast and break things."[0]

Another way to phrase this is:

  Move fast enough while breaking things and regulations
  can never catch up.
0 - https://quotes.guide/mark-zuckerberg/quote/move-fast-and-bre...

jcranmer

The short answer is that there is actually a number of active lawsuits alleging copyright violation, but they take time (years) to resolve. And since it's only been about two years since we've had the big generative AI blow up, fueled by entities with deep pockets (i.e., you can actually profit off of the lawsuit), there quite literally hasn't been enough time for a lawsuit to find them in violation of copyright.

And quite frankly, between the announcement of several licensing deals in the past year for new copyrighted content for training, and the recent decision in Warhol "clarifying" the definition of "transformative" for the purposes of fair use, the likelihood of training for AI being found fair is actually quite slim.

null

[deleted]

alphan0n

Simply put, if the model isn’t producing an actual copy, they aren’t violating copyright (in the US) under any current definition.

As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.

If I use a copy machine to reproduce your copyrighted work, I am responsible for that infringement not Xerox.

If I coax your copyrighted work out of my phones keyboard suggestion engine letter by letter, and publish it, it’s still me infringing on your copyright, not Apple.

If I make a copy of your clip art in Illustratator, is Adobe responsible? Etc.

Even if (as I’ve seen argued ad nauseaum) a model was trained on copyrighted works on a piracy website, the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.

Not to mention, I can walk into any public library and learn something from any book there, would I then owe the authors of the books I learned from a fee to apply that knowledge?

marxisttemp

“There must be in-groups whom the law protects but does not bind, alongside out-groups whom the law binds but does not protect.”

teleforce

>perhaps they were routing API calls to human workers

Honest question, did they?

charlieyu1

Why would they use the materials in model training? It would defeat the purpose of having a benchmarking set

wokwokwok

If you’re a research lab then yes.

If you’re a for profit company trying to raise funding and fend off skepticism that your models really aren’t that much better than any one else’s, then…

It would be dishonest, but as long as no one found out until after you closed your funding round, there’s plenty of reason you might do this.

It comes down to caring about benchmarks and integrity or caring about piles of money.

Judge for yourself which one they chose.

Perhaps they didn’t train on it.

Who knows?

It’s fair to be skeptical though, under the circumstances.

null

[deleted]

cma

OpenAI's benchmark results looking like Musk's Path of Exile character..

echelon

This has me curious about ARC-AGI.

Would it have been possible for OpenAI to have gamed ARC-AGI by seeing the first few examples and then quickly mechanical turking a training set, fine tuning their model, then proceeding with the rest of the evaluation?

Are there other tricks they could have pulled?

It feels like unless a model is being deployed to an impartial evaluator's completely air gapped machine, there's a ton of room for shenanigans, dishonesty, and outright cheating.

trott

> This has me curious about ARC-AGI

In the o3 announcement video, the president of ARC Prize said they'd be partnering with OpenAI to develop the next benchmark.

> mechanical turking a training set, fine tuning their model

You don't need mechanical turking here. You can use an LLM to generate a lot more data that's similar to the official training data, and then you can train on that. It sounds like "pulling yourself up by your bootstraps", but isn't. An approach to do this has been published, and it seems to be scaling very well with the amount of such generated training data (They won the 1st paper award)

null

[deleted]

WiSaGaN

In their benchmark, they have a tag "tuned" attached to their o3 result. I guess we need they to inform us of the exact meaning of it to gauge.

riku_iki

> OpenAI to have gamed ARC-AGI by seeing the first few examples

not just few examples. o3 was evaluated on "semi-private" test, which was previously already used for evaluating OAI models, so OAI had access to it already for a long time.

lolinder

A co-founder of Epoch left a note in the comments:

> We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.

Ouch. A verbal agreement. As the saying goes, those aren't worth the paper they're written on, and that's doubly true when you're dealing with someone with a reputation like Altman's.

And aside from the obvious flaw in it being a verbal agreement, there are many ways in which OpenAI could technically comply with this agreement while still gaining a massive unfair advantage on the benchmarks to the point of rendering them meaningless. For just one example, knowing the benchmark questions can help you select training data that is tailored to excelling at the benchmarks without technically including the actual question in the training data.

aithrowawaycomm

What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set, even though elsewhere Epoch AI strongly implied this already existed: https://xcancel.com/ElliotGlazer/status/1880809468616950187

It seems to me that o3's 25% benchmark score is 100% data contamination.

EagnaIonat

> What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set,

There is nothing suspicious about this and the wording seems to be incorrect.

A hold-out set is a percentage of the overall data that is used to test a model. It is just not trained on it. Model developers normally have full access to it.

There is nothing inherently wrong with training on a full/partial hold out set. It just means you have done a different split to train again.

The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.

Even so blind sets can also go stale after a few runs and nothing is wrong with ingesting that blind set, as long as you have a new blind set to run against.

Trying to game blind set tests is nothing new and it gets very quickly found out.

What I took from the original article is that the blind set is likely unbalanced and it answered more easier questions than hard ones.

cma

> I just saw Sam Altman speak at YCNYC and I was impressed. I have never actually met him or heard him speak before Monday, but one of his stories really stuck out and went something like this:

> "We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.

> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.

> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract."

> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.

https://news.ycombinator.com/item?id=3048944

AyyEye

Nothing says genuine like lying to get a contract.

teaearlgraycold

This was my assumption all along.

sillysaurusx

The questions are designed so that such training data is extremely limited. Tao said it was around half a dozen papers at most, sometimes. That’s not really enough to overfit on without causing other problems.

lolinder

> That’s not really enough to overfit on without causing other problems.

"Causing other problems" is exactly what I'm worried about. I would not put it past OpenAI to deliberately overfit on a set of benchmarks in order to keep up the illusion that they're still progressing at the rate that the hype has come to expect, then keep the very-dangerous model under wraps for a while to avoid having to explain why it doesn't act as smart as they claimed. We still don't have access to this model (because, as with everything since GPT-2, it's "too dangerous"), so we have no way of independently verifying its utility, which means they have a window where they can claim anything they want. If they release a weaker model than claimed it can always be attributed to guardrails put in place after safety testing confirmed it was dangerous.

We'll see when the model actually becomes available, but in the meantime it's reasonable to guess that it's overfitted.

null

[deleted]

ripped_britches

Do people actually think OpenAI is gaming benchmarks?

I know they have lost trust and credibility, especially on HN. But this is a company with a giant revenue opportunity to sell products that work.

What works for enterprise is very different from “does it beat this benchmark”.

No matter how nefarious you think sama is, everything points to “build intelligence as rapidly as possible” rather than “spin our wheels messing with benchmarks”.

In fact, even if they did fully lie and game the benchmark - do you even care? As an OpenAI customer, all I care about is that the product works.

I code with o1 for hours every day, so I am very excited for o3 to be released via API. And if they trained on private datasets, I honestly don’t care. I just want to get a better coding partner until I’m irrelevant.

Final thought - why are these contractors owed a right to know where funding came from? I would definitely be proud to know I contributed to the advancement of the field of AI if I was included in this group.

mlsu

Gaming benchmarks has a lot of utility for openAI whether their product works or not.

Many people compare models based on benchmarks. So if openAI can appear better to Anthropic, Google, or Meta, by gaming benchmarks, it's absolutely in their interest to do so, especially if their product is only slightly behind, because evaluating model quality is very very tricky business these days.

In particular, if there is a new benchmark, it's doubly in their interest to game it, because they know that other providers will start using and optimizing performance towards that benchmark, in order to "beat" OpenAI and win market share.

On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?

This is a company which is shedding their coats of ethics and scientific rigor -- so as to be as unencumbered as possible in its footrace to the dollar.

ripped_britches

I think “getting beat handily” is a HN bubble concept. Depends on what you’re using it for, but I personally prefer 4o for coding. In enterprise usage, i think 4o is smoking 3.5 sonnet, but that’s just my perception from folks I talk to.

hatefulmoron

I don't think that's true, you'll get the same sentiment ("Sonnet 3.5 is much better than GPT4/GPT4o [for coding]") pretty uniformly across Reddit/HN/Lobsters. I would strongly agree with it in my own testing, although o1 might be much better (I'm too poor to give it a fair shake.)

> In enterprise usage, i think 4o is smoking 3.5 sonnet

True. I'm not sure how many enterprise solutions have given their users an opportunity to test Claude vs. GPT. Most people just use whatever LLM API their software integrates.

raincole

> On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?

I do use Sonnet 3.5 personally, but this "beat handily" doesn't show on LLM arena. Do OpenAI game that too?

bugglebeetle

I used to think this, but using o1 quite a bit lately has convinced me otherwise. It’s been 1-shotting the fairly non-trivial coding problems I throw at it and is good about outputting large, complete code blocks. By contrast, Claude immediately starts nagging you about hitting usage limits after a few back and forth and has some kind of hack in place to start abbreviating code when conversations get too long, even when explicitly instructed to do otherwise. I would imagine that Anthropic can produce a good test time compute model as well, but until they have something publicly available, OpenAI has stolen back the lead.

jatins

> Do people actually think OpenAI is gaming benchmarks?

I was blown away by chatgpt release and generally have admired OpenAI however I wouldn't put it past them

At this point their entire marketing strategy seems to be to do vague posting on X/Twitter and keep hyping the models so that investors always feel there is something around the corner

And I don't think they need to do that. Most investors will be throwing money at them either way but maybe when you are looking to raise _billions_ that's not enough

331c8c71

Well I certainly won't object if oai marketing was based on testimonials from their fanboy customers instead of rigged benchmark scores %)

Your fragrant disregard for ethics and focus on utilitarian aspects is certainly quite extreme to the extent that only a view people would agree with you in my view.

jsheard

Why do people keep taking OpenAIs marketing spin at face value? This keeps happening, like when they neglected to mention that their most impressive Sora demo involved extensive manual editing/cleanup work because the studio couldn't get Sora to generate what they wanted.

https://news.ycombinator.com/item?id=40359425

th1243127

It might be because (very few!) mathematicians like Terence Tao make positive remarks. I think these mathematicians should be very careful to use reproducible and controlled setups that by their nature cannot take place on GPUs in the Azure cloud.

I have nothing against scientists promoting the Coq Proof Assistant. But that's open source, can be run at home and is fully reproducible.

aithrowawaycomm

Keep in mind those mathematicians were kept in the dark about the funding: it is incredibly unethical to invite a coauthor to your paper and not tell where the money came from.

It's just incredibly scummy behavior: I imagine some of those mathematicians would have declined the collaboration if the funding were transparent. More so than data contamination, this makes me deeply mistrustful of Epoch AI.

Vecr

Wait, I think I somehow knew Epoch AI was getting money from OpenAI. I'm not sure how, and I didn't connect any of the facts together to think of this problem in advance.

refulgentis

I can't parse any of this, can you explain to a noob? I get lost immediately: funding, coauthor, etc. Only interpretation I've come to is I've missed a scandal involving payola, Terence Tao, and keeping coauthors off papers

refulgentis

Because the models have continually matched the quality they claim.

Ex. look how much work "very few" has to do in the sibling comment. It's like saying "very few physicists [Einstein/Feynman/Witten]"

Its conveniently impossible to falsify the implication that the inverse of "very few" say not positive things. i.e. that the vast majority say negative things

You have to go through an incredible level of mental gymnastics, involving many months of gated decisions, where the route chosen involved "gee, I know this is suspectable to confirmation bias, but...", to end up wondering why people think the models are real if OpenAI has access to data that includes some set of questions.

diggan

> Tamay from Epoch AI here. We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.

Not sure if "integrity of the benchmarks" should even be something that you negotiate over, what's the value of the benchmark if the results cannot be trusted because of undisclosed relationships and sharing of data? Why would they be restricted from disclosing stuff you normally disclose, and how doesn't that raise all sorts of warning flags when proposed even?

aunty_helen

This feels like a done deal. This benchmark should be discarded.

optimalsolver

>OpenAI has data access to much but not all of the dataset

Their head mathematician says they have the full dataset, except a holdout set which they're currently developing (i.e. doesn't exist yet):

https://www.reddit.com/r/singularity/comments/1i4n0r5/commen...

zarzavat

OpenAI played themselves here. Now nobody is going to take any of their results on this benchmark seriously, ever again. That o3 result has just disappeared in a poof of smoke. If they had blinded themselves properly then that wouldn't be the case.

Whereas other AI companies now have the opportunity to be first to get a significant result on FrontierMath.

colonial

I'd be surprised if any of their in-house benchmark results are taken seriously after this. As an extremely rough estimate, FrontierMath cost five to six figures to assemble [1] - so from an outside view, they clearly have no qualms with turning cash into quasi-guaranteed benchmark results.

[1]: https://epoch.ai/math-problems/submit-problem - the benchmark is comprised of "hundreds" of questions, so at the absolute lowest it cost 300 * 200 = 60,000 dollars.

eksu

This risk could be mitigated by publishing the test.

bogtog

A lot of the comments express some type of deliberate cheating the benchmark. However, even without intentionally trying to game it, if anybody can repeatedly take the same test, then they'll be nudged to overfit/p-hack.

For instance, suppose they conduct an experiment and find that changing some hyper-parameter yields a 2% boost. That could just be noise, it could be a genuine small improvement, or it may be a mix of a genuine boost along with some fortunate noise. An effect may be small enough that researchers would need to rely on their gut to interpret it. Researchers may jump on noise while believing they have discovered true optimizations. Enough of these types of nudges, and some serious benchmark gains can materialize.

(Hopefully my comment isn't entirely misguided, I don't know how they actually do testing or how often they probe their test set)

madars

I cringe every time I see "my IQ increased by X points after doing Y" posts on Twitter - yes, you had a practice run on Raven's progressive matrices a month ago, that helped, these have a limited question bank and the effect of Y is marginal. That said, obviously, test taking is a skill (separate from background knowledge and both general/domain-specific ability) and should be trained if you expect to have life-altering events based on tests (i.e., do an LSAT course if you want to go to law school). Conversely, shouldn't be done if you think it will limit you through superstition ("I had a score of X, thus I can only perform around level of X+fudge factor"). For an LLM company a good test score is a valuation-altering event!

MattDaEskimo

There's something gross about OpenAI constantly misleading the public.

This maneuver by their CEO will destroy FrontierMath and Epoch AI's reputation

cbracketdash

Reminds me of the following proverb:

"The integrity of the upright guides them, but the unfaithful are destroyed by their duplicity."

(Proverbs 11:3)

lionkor

People on here were mocking me openly when I pointed out that you can't be sure LLMs (or any AIs) are actually smart unless you CAN PROVE that the question you're asking isn't in the training set (or adjacent like in this case).

So with this in mind now, let me repeat: Unless you know that the question AND/OR answer are not in the training set or adjacent, do not claim that the AI or similar black box is smart.

pcmoore

I ran a test yesterday on ChatGPT and co-pilot asking first if it knew of a specific paper which it confirmed and then to derive simple results from which it was completely incapable of. I know this paper is not widely referenced (ie few known results in the public domain) but has been available for over 15 years with publicly accessible code written by humans. The training set was so sparse it had no ability to "understand" or even regurgitate past the summary text which it listed almost verbatim.

Vecr

It is known that current models have terrible sample efficiency. I've been told that it's better than I thought it was, but it still isn't good.

sitkack

This all smells like the OpenAI CEO's MO. Stupid drama for stupid reasons.

padolsey

Many of these evals are quite easy to game. Often the actual evaluation part of benchmarking is left up to a good-faith actor, which was usually reasonable in academic settings less polluted by capital. AI labs, however, have disincentives to do a thorough or impartial job, so IMO we should never take their word for it. To verify, we need to be able to run these evals ourselves – this is only sometimes possible, as even if the datasets are public, the exact mechanisms of evaluation are not. In the long run, to be completely resilient to gaming via training, we probably need to follow suit of other fields and have third-party non-profit accredited (!!) evaluators who's entire premise is to evaluate, red-team, and generally keep AI safe and competent.

moi2388

“… we have a verbal agreement that these materials will not be used in model training”

What about model testing before releasing it?

wujerry2000

My takeaways

(1) Companies will probably increasingly invest in building their own evals for their use cases because its becoming clear public/allegedly private benchmarks have misaligned incentives with labs sponsoring/cheating (2) Those evals will prob be proprietary "IP" - guarded as closely as the code or research itself (3) Conversely, public benchmarks are exhausted and SOMEONE has to invest in funding more frontier benchmarks. So this is prob going to continue.