Skip to content(if available)orjump to list(if available)

OpenAI claims Gold-medal performance at IMO 2025

mikert89

The cynicism/denial on HN about AI is exhausting. Half the comments are some weird form of explaining away the ever increasing performance of these models

I've been reading this website for probably 15 years, its never been this bad. many threads are completely unreadable, all the actual educated takes are on X, its almost like there was a talent drain

uludag

Cynicism and denial are two very different things, and have very different causes and justifications. I personally don't deny that LLMs are very powerful and are capable and capable of eliminating many jobs. At the same time I'm very cynical about the rollout and push for AI. I don't see in any way as a push for a "better" society or towards some notion of progress, but rather an enthusiastic effort to disempower employees, centralize power, expand surveillance, increase profits, etc.

kilna

AI is kerosene. A useful resource when applied with reason and compassion. Late stage capitalism is a dumpster full of garbage. AI in combination with late stage capitalism is a dumpster fire. Many, perhaps most people conflate the dumpster fire with "kerosine evil!"

sealeck

> I've been reading this website for probably 15 years, its never been this bad... all the actual educated takes are on X

Almost every technical comment on HN is wrong (see for example essentially all the discussion of Rust async, in which people keep making up silly claims that Rust maintainers then attempt to patiently explain are wrong).

The idea that the "educated" takes are on X though... that's crazy talk.

wrsh07

With regard to AI & LLMs Twitter/x is actually the only place with all of the industry people discussing.

There are a bunch of great accounts to follow that are only really posting content to x.

Karpathy, nearcyan, kalomaze, all of the OpenAI researchers including the link this discussion is on, many anthropic researchers. It's such a meme that you see people discuss reading Twitter thread + paper because the thread gives useful additional context.

Hn still has great comment sections on maker style posts, on network stuff, but I no longer enjoy the discussions wrt AI here. It's too hyperbolic.

mikert89

that people on here dont know alot of the leading researchers only post on X is a tell in itself

ptero

I see the same effect regarding macroeconomic discussions. Great content on X that is head and shoulders (says me) above discussions on other platforms.

breadsniffer

Yup! People here are great hackers but it’s almost like they have their head up their own ass when it comes to AI/ML.

Most of HN was very wrong about LLMs.

aerhardt

Too hyperbolic for, against, or either way?

adastra22

This is true of every forum and every topic. When you actually know something about the topic you realize 90% of the takes about it are garbage.

But in most other sites the statistic is 99%, so HN is still doing much better than average.

scellus

No on AI, this is really a fringe environment of relatively uninformed commenters, compared to X. X has its craziness but you can curate your feeds by using lists. Here I can't choose who to follow.

And like said, the researchers themselves are on X, even Gary Marcus is there. ;)

halfmatthalfcat

The overconfidence/short sightedness on HN about AI is exhausting. Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.

Aurornis

> Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.

I do not see that at all in this comment section.

There is a lot of denial and cynicism like the parent comment suggested. The comments trying to dismiss this as just “some high school math problem” are the funniest example.

kenjackson

I went through the thread and saw nothing that looked like this.

I don’t think developers will be obsolete in five years. I don’t think AGI is around the corner. But I do think this is the biggest breakthrough in computer science history.

I worked on accelerating DNNs a little less than a decade ago and had you shown me what we’re seeing now with LLMs I’d say it was closer to 50 years out than 20 years out.

mikert89

its very clearly a major breakthrough for humanity

AtlasBarfed

Greatest breakthru in compsci.

You mean the one that paves the way for ancient Egyptian slave worker economies?

Or totalitarian rule that 1984 couldn't imagine?

Or...... Worse?

The intermediate classes of society always relied on intelligence and competence to extract money from the powerful.

AI means those classes no longer have power.

halfmatthalfcat

You're missing the joke homie.

infecto

I don’t typically find this to be true. There is a definite cynicism on HN especially when it comes to OpenAI. You already know what you will see. Low quality garbage of “I remember when OpenAI was open”, “remember when they used to publish research”, “sama cannot be trusted”, it’s an endless barrage of garbage.

mikert89

its honestly ruining this website, you cant even read the comments sections anymore

halfmatthalfcat

Incredible how many HNers cannot see this comment for what it is.

blamestross

Nobody likes the idea that this is only "economical superior AI". Not as good as humans, but a LOT cheaper.

The "It will just get better" is bubble baiting the investors. The tech companies learned from the past and they are riding and managing the bubble to extract maximum ROI before it pops.

The reality is a lot of work done by humans can be replaced by an LLM with lower quality and nuance. The loss in sales/satisfaction/ect is more than offset by the reduced cost.

The current model of LLMs are enshitification accelerators and that will have real effects.

null

[deleted]

mpalmer

Enthusiastically denouncing or promoting something is much, much easier and more rewarding in the short term for people who want to appear hip to their chosen in-group - or profit center.

And then, it's likewise easy to be a reactionary to the extremes of the other side.

The middle is a harder, more interesting place to be, and people who end up there aren't usually chasing money or power, but some approximation of the truth.

motoboi

Accepting openai at face value is just the lazy stance.

Finding a critic perspective and try to understand why it can be wrong is more fun. You just say "I was wrong" when proved wrong.

uh_uh

Come on, you must know OP's comment is not just about this particular announcement by OpenAI. There is a general anti-AI sentiment on this forum.

miguelacevedo

Basically this. Not sure why people here love to doubt AI progress as it clearly makes strides

riku_iki

because per corps statements, AI are now top 0.1% of PhD in math, coding, physics, law, medicine etc, yet, when I try it myself for my work it makes stupid mistakes, so I have suspicion that corp very pushy on manipulating metrics/benchmarks.

gellybeans

Making an account just to point out how these comments are far more exhausting, because they don't engage with the subject matter. They are just agreeing with a headline and saying, "See?"

You say, "explaining away the increasing performance" as though that was a good faith representation of arguments made against LLMs, or even this specific article. Questionong the self-congragulatory nature of these businesses is perfectly reasonable.

uh_uh

But don't you think this might be a case where there is both self-congragulation and actual progress?

softwaredoug

Probably because both sides have strong vested interests and it’s next to impossible to find a dispassionate point of view.

The Pro AI crowd, VC, tech CEOs etc have strong incentive to claim humans are obsolete. Many tech employees see threats to their jobs and want to poopoo any way AI could be useful or competitive.

orbital-decay

That's a huge hyperbole. I can assure you many people find the entire thing genuinely fascinating, without having any vested interest and without buying the hype.

spacemadness

Sure but it’s still a gold rush with a lot of exaggeration pushed by tech executives to acquire investors. There’s a lot of greed and fear to go around. I think LLMs are fascinating and cool myself having grown up with Eliza and crappy expert systems, but am more interested in deep learning outcomes like Alphafold than general purpose LLMs. You don’t hear enough about non-LLM AI because of all the money riding on LLM based tech. It’s hard not to see the bad behavior that has arisen due to all the money being thrown about. So that is to say it makes sense there is some skepticism as you can’t take what these companies say at face value. It’d be nice to have a toned down discussion about what LLMs can and can’t do but there is a lot of obfuscation and hype. Also there is the conversation about what they should or shouldn’t be doing which is completely fair to talk about.

rvz

Or some can spot a euphoric bubble when they see it with lots of participants who have over-invested in 90% of these so called AI startups that are not frontier labs.

yunwal

What does this have to do with the math Olympiad? Why would it frame your view of the accomplishment?

mikert89

dude we have computers reasoning in english to solve math problems, what are you even talking about

chii

That's just another way to state that everybody is almost always self-serving when it comes to anything.

eab-

[Joseph Myers (IMO committee) in the Lean Zulip](https://leanprover.zulipchat.com/#narrow/channel/219941-Mach...):

> I talked to IMO Secretary General Ria van Huffel at the IMO 2025 closing party about the OpenAI announcement. While I can't speak for the Board or the IMO (and didn't get a chance to talk about this with IMO President Gregor Dolinar, and I doubt the Board are readily in a position to meet for the next few days while traveling home), Ria was happy for me to say that it was the general sense of the Jury and Coordinators at IMO 2025 that it's rude and inappropriate for AI developers to make announcements about their IMO performances too close to the IMO (such as before the closing party, in this case; the general coordinator view is that such announcements should wait at least a week after the closing ceremony), when the focus should be on the achievements of the actual human IMO contestants and reports from AIs serve to distract from that.

> I don't think OpenAI was one of the AI companies that agreed to cooperate with the IMO on testing their models and don't think any of the 91 coordinators on the Sunshine Coast were involved in assessing their scripts.

modeless

Noam Brown:

> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.

I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.

https://x.com/polynoamial/status/1946478249187377206

stingraycharles

My issue with all these citations is that it’s all OpenAI employees that make these claims.

I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.

do_not_redeem

A third party tried this experiment with publicly available models. OpenAI did half as well as Gemini, and none of the models even got bronze.

https://matharena.ai/imo/

jsnell

I feel you're misunderstanding something. That's not "this exact experiment". Matharena is testing publicly available models against the IMO problem set. OpenAI was announcing the results of a new, unpublished model, on that problems set.

It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.

csomar

The issue is that trust is very hard to build and very easy to lose. Even in today's age where regular humans have a memory span shorter than that of an LLM, OpenAI keeps abusing the public's trust. As a result, I take their word on AI/LLMs about as seriously as I'd take my grocery store clerk's opinion on quantum physics.

emp17344

I still haven’t forgotten OpenAI’s FrontierMath debacle from December. If they really have some amazing math-solving model, give us more info than a vague twitter hype-post.

null

[deleted]

mrdependable

I like how they always say AI will advance science when they want to sell it to the public, but pump how it will replace workers when selling it to businesses. It’s like dangling a carrot while slowly putting a knife to our throats.

YeGoblynQueenne

How is a claim, "clear evidence" to anything?

kelipso

Haha, if Musk made a claim five years ago, it would’ve been taken as clear evidence here. Now it’s other people I guess, hype never dies.

modeless

Most evidence you have about the world is claims from other people, not direct experiment. There seems to be a thought-terminating cliche here on HN, dismissing any claim from employees of large tech companies.

Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.

sealeck

> I judge people's trustworthiness individually and not solely by the organization they belong to

This is certainly a courageous viewpoint – I imagine this makes it very hard for you to engage in the modern world? Most of us are very bound by institutions we operate in!

emp17344

OpenAI have already shown us they aren’t trustworthy. Remember the FrontierMath debacle?

lossolo

> it’s also more efficient [than o1 or o3] with its thinking.

"So under his saturate response, he never loses. For her to win, must make him unable at some even -> would need Q_{even-1}>even, i.e. some a_j> sqrt2. but we just showed always a_j<=c< sqrt2. So she can never cause his loss. So against this fixed response of his, she never wins (outcomes: may be infinite or she may lose by sum if she picks badly; but no win). So she does NOT have winning strategy at λ=c. So at equality, neither player has winning strategy."[1]

Why use lot word when few word do trick?

1. https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro...

strangeloops85

I assume there was tool use in the fine tuning?

chairhairair

OpenAI simply can’t be trusted on any benchmarks: https://news.ycombinator.com/item?id=42761648

qoez

Remember that they've fired all whistleblowers that would admit to breaking the verbal agreement that they wouldn't train on the test data.

samat

Could not find it on the open web. Do you have clues to search for?

amelius

This is not a benchmark, really. It's an official test.

andrepd

And what were the methods? How was the evaluation? They could be making it all up for all we know!

Aurornis

The International Math Olympiad isn’t an AI benchmark.

It’s an annual human competition.

meroes

They didn’t actually compete.

ALLTaken

I think OpenAI participating is nothing but a publicity stunt and wholly unfair and disrespectful against Human participants. It should be allowed for AI models to participate, but it should not be ranked equally, nor put any engineers under duress of having to pull all-nighters. AI model performance should be shown T+2 days AFTER the contest! I wish that real Humans who worked hard can enjoy the attention, price and respect they deserve!

Billion dollar companies stealing not only the price, prestige, time and sleep of participants by brute-forcing their model through all illegally scraped Code via GitHub is a disgrace to humanity.

AI models should read the same materials to become proficient in coding, without having trillions of lines of code to ape through mindlessly. Otherwise the "AI" is no different than an elaborate Monte Carlo Tree Search (MCTS).

Yes I know AI is quite advanced. I know that quite well and study latest SOTA papers daily, have developed my own models aswell from the ground up, but it's despite all the advancements still far away from substantially being better than MCTS (see: https://icml.cc/virtual/2025/poster/44177 and https://allenai.org/blog/autods )

EDIT, adding proof:

This is the results of the last competition they tried to win and have LOST: https://arstechnica.com/ai/2025/07/exhausted-man-defeats-ai-...

(Looks like a pattern OpenAI Corp is scraping competitions to place themselves into the spotlight and headlines.)

Aurornis

> I think OpenAI participating is nothing but a publicity stunt and wholly unfair and disrespectful against Human participants. It should be allowed for AI models to participate, but it should not be ranked equally,

OpenAI did not participate in the actual competition nor were they taking spots away from humans. OpenAI just gave the problems to their AI under the same time limit and conditions (no external tool use)

> nor put any engineers under duress of having to pull all-nighters.

Under duress? At a company like this, all of the people working on this project are there because they want to be and they’re compensated millions.

jsnell

As far as I can tell, OpenAI didn't participate, and isn't claiming they participated. Note the fairly precise phrasing of "gold medal-level performance": they claim to have shown performance sufficient for a gold, not that they won one.

esperent

> they claim to have shown performance sufficient for a gold

This sounds very like Ferrari claiming that their cars can drive fast enough to get gold in the Olympic games 100 meter sprint.

shawabawa3

Not at all

It's more like a chess engine claiming master level performance (back when that was an achievement)

aubanel

- AI competing is "wholly unfair"

- "[AI is] far away from being substantially being better than MCTs"

^ pick only one

yobbo

Running MCTS over algorithms is the part that might be considered unfair if used in competition with humans.

threatripper

Humans should be allowed to compete in groups of arbitrary size. This would also be a demonstration of excellent teamwork under time pressure.

pclmulqdq

In a general sense, cheating and losing are not mutually exclusive.

stingraycharles

Yeah it’s a completely fair playing field, it’s completely obvious that AI should be able to compete with humans in the same way that robotics and computers can compete with humanity (and are better suited for many tasks).

Whether or not they’re far away from being better than humans is up to debate, but the entire point of these types of benchmarks it to compare them to humans.

bluecalm

>>Yeah it’s a completely fair playing field, it’s completely obvious that AI should be able to compete with humans in the same way that robotics and computers can compete with humanity (and are better suited for many tasks).

Yeah same way computers and robots should be able to win World Chess Championship, 100m dash and Wimbledon.

>>but the entire point of these types of benchmarks it to compare them to humans

The entire point of the competition is to fight against participants who are similar to you, have similar capabilities and go through similar struggles. If you want bot vs human competitions - great - organize it yourself instead of hijacking well established competitions out there.

null

[deleted]

skepticATX

[delayed]

chvid

I believe this company used to present its results and approach in academic papers with enough details so that it could be reproduced by third parties.

Now it is just doing a bunch of tweets?

do_not_redeem

They're doing tweets because the results cannot be reproduced. https://matharena.ai/

falcor84

The tweets say that it's an unreleased model

ipsum2

Different model.

samrus

thats when they were a real research company. the last proper research they did was instructGPT, everything since has been product development and following others. the reputation hit hasnt caught up with them because sam altman has built a whole career out of outrunning the reputation lag

samat

This company used to be non profit

And many other things

z7

Some previous predictions:

In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.

He thought there was an 8% chance of this happening.

Eliezer Yudkowsky said "at least 16%".

Source:

https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...

Voloskaya

Off topic, but am I the only one getting triggered every time I see a rationalist quantify their prediction of the future with single digit accuracy? It's like their magic way of trying to get everyone to forget that they reached their conclusion in completely hand-wavy way, just like every other human being. But instead of saying "low confidence" or "high confidence" like the rest of us normies, they will tell you they think there is 16.27% chance because they really really want you to be aware that they know bayes theorem.

sigmoid10

While I usually enjoy seeing these discussions, I think they are really pushing the usefulness of bayesian statistics. If one dude says the chance for an outcome is 8% and another says it's 16% and the outcome does occur, they were both pretty wrong, even though it might seem like the one who guessed a few % higher might have had a better belief system. Now if one of them had said 90% while the other said 8% or 16%, then we should pay close attention to what they are saying.

grillitoazul

From a mathematical point of view there are two factors: (1) Initial prior capability of prediction from the human agents and (2) Acceleration in the predicted event. Now we examine the result under such a model and conclude that:

The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).

Here we are supposing that the increase in training data is not the main explanatory factor.

This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.

grillitoazul

Another take at a sound interpretation:

(1) Bad prior prediction capability of humans imply that result does not provide any information

(2) Good prior prediction capability of humans imply that there is acceleration in math capabilities of LLMs.

zeroonetwothree

A 16% or even 8% event happening is quite common so really it tells us nothing and doesn’t mean either one was pretty wrong.

exegeist

Impressive prediction, especially pre-ChatGPT. Compare to Gary Marcus 3 months ago: https://garymarcus.substack.com/p/reports-of-llms-mastering-...

We may certainly hope Eliezer's other predictions don't prove so well-calibrated.

rafaelero

Gary Marcus is so systematically and overconfidently wrong that I wonder why we keep talking about this clown.

qoez

People just give attention to people making surprising bold counter narrative predictions but don't give them any attention when they're wrong.

causal

These numbers feel kind of meaningless without any work showing how he got to 16%

dcre

I do think Gary Marcus says a lot of wrong stuff about LLMs but I don’t see anything too egregious in that post. He’s just describing the results they got a few months ago.

m3kw9

He definitely cannot use the original arguments from then ChatGPT arrived, he's a perennial goal post shifter.

shuckles

My understanding is that Eliezer more or less thinks it's over for humans.

andrepd

Context? Who are these people and what are these numbers and why shouldn't I assume they're pulled from thin air?

Voloskaya

> why shouldn't I assume they're pulled from thin air?

You definitely should assume they are. They are rationalists, the modus operandi is to pull stuff out of thin air and slap a single digit precision percentage prediction in front to make it seems grounded in science and well thought out.

Maxious

ask chatgpt

demirbey05

Google also joined IMO, and got gold prize.

https://x.com/natolambert/status/1946569475396120653

OAI announced early, probably we will hear announcement from Google soon.

gniv

From that thread: "The model solved P1 through P5; it did not produce a solution for P6."

It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.

gus_massa

In the IMO, the idea is that the first day you get P1, P2 and P3, and the second day you get P4, P5 and P6. Usually, ordered by difficulty, they are P1, P4, P2, P5, P3, P6. So, usually P1 is "easy" and P6 is very hard. At least that is the intended order, but sometime reality disagree.

Edit: Fixed P4 -> P3. Thanks.

masterjack

In this case P6 was unusually hard and P3 was unusually easy https://sugaku.net/content/imo-2025-problems/

thundergolfer

You have P4 twice in there, latter should be 3

demirbey05

I think from Canada team someone solved it but among all, its very few

johnecheck

Wow. That's an impressive result, but how did they do it?

Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.

fnordpiglet

Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting? Regardless of the tools for verification or even solvers - why is the goal post moving so fast? There is no bonus for “purity of essence” and using only neural networks. We live in an era where it’s hard to tell if machines are thinking or not, which for since the first computing machines was seen as the ultimate achievement. Now we Pooh Pooh the results of each iteration - which unfold month over month not decade over decade now.

You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.

YeGoblynQueenne

>> Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting?

Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.

falcor84

I don't get it, how do you "big data cheat" an AI into solving previously unencountered problems? Wouldn't that just be engineering?

parasubvert

I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.

Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.

This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.

Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.

If you're not familiar with System 1 / System 2, it's googlable .

lordnacho

> These models cannot reason

Not trying to be a smarty pants here, but what do we mean by "reason"?

Just to make the point, I'm using Claude to help me code right now. In between prompts, I read HN.

It does things for me such as coding up new features, looking at the compile and runtime responses, and then correcting the code. All while I sit here and write with you on HN.

It gives me feedback like "lock free message passing is going to work better here" and then replaces the locks with the exact kind of thing I actually want. If it runs into a problem, it does what I did a few weeks ago, it will see that some flag is set wrong, or that some architectural decision needs to be changed, and then implements the changes.

What is not reasoning about this? Last year at this time, if I looked at my code with a two hour delta, and someone had pushed edits that were able to compile, with real improvements, I would not have any doubt that there was a reasoning, intelligent person who had spent years learning how this worked.

It is pattern matching? Of course. But why is that not reasoning? Is there some sort of emergent behavior? Also yes. But what is not reasoning about that?

I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.

logicchains

>I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result

This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.

Davidzheng

I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.

FeepingCreature

The whole point of RL is if you can get it to work 0.01% of the time you can get it to work 100% of the time.

lcnPylGDnU4H9OF

> what tools were used and how the model used them

According to the twitter thread, the model was not given access to tools.

constantcrying

>if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.

johnecheck

The key bit here is whether the LLM doing the cherry picking had knowledge of the solution. If it didn't, this is a meaningful result. That's why I'd like more info, but I fear OpenAI is going to try to keep things under wraps.

diggan

> If it didn't

We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.

strangeloops85

It’s interesting how hard and widespread a push they’re making in advertising this - at this particular moment, when there are rumors of more high level recruitment attempts / successes by Zuck. OpenAI is certainly a master at trying to drive narratives. (Independent of the actual significance / advance here). Sorry, there are too many hundreds of billions of dollars involved to not be a bit cautious and wary of claims being pushed this hard.