Gemini with Deep Think achieves gold-medal standard at the IMO
177 comments
·July 21, 2025modeless
rfurmani
Sounds like it did not:
> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit
simonw
I interpreted that bit as meaning they did not manually alter the problem statement before feeding it to the model - they gave it the exact problem text issued by IMO.
It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.
jonahx
As a side question, do you think using tools like Lean will become a staple of these "deep reasoning" LLM flavors?
It seems that LLMs excel (relative to other paradigms) in the kind of "loose" creative thinking humans do, but are also prone to the same kinds of mistakes humans make (hallucinations, etc). Just as Lean and other formal systems can help humans find subtle errors in their own thinking, they could do the same for LLMs.
modeless
Yes, that quote is contained in my comment. But I don't think it unambiguously excludes tool use in the internal chain of thought.
I don't think tool use would detract from the achievement, necessarily. I'm just interested to know.
KoolKat23
End to end in natural language would imply no tool use, I'd imagine. Unless it called another tool which converted it but that would be a real stretch (smoke and mirrors).
getnormality
We're told that formal verification tools like Lean are not used to solve the actual IMO problems, but are they used in training the model to solve the problems?
We know from Google's 2024 IMO work that they have a way to translate natural language proofs to formally verifiable ones. It seems like a natural next step would be to leverage this for RLVR in training/fine-tuning. During training, any piece of reasoning generated by the math LLM could be translated, verified, and assigned an appropriate reward, making the reward signal much denser.
Reward for a fully correct proof of a given IMO problem would still be hard to come by, but you could at least discourage the model from doing wrong or indecipherable things. That plus tons of compute might be enough to solve IMO problems.
In fact it probably would be, right? We already know from AlphaProof that by translating LLM output back and forth between formal Lean proofs, you can search the space of reasoning moves efficiently enough to solve IMO-class problems. Maybe you can cut out the middleman by teaching the LLM via RLVR to mimic formal reasoning, and that gets you roughly the same efficiency and ability to solve hard problems.
modeless
It seems very likely from the description in the link that formal verification tools for mathematical proofs were used in part of the RL training for this model. On the other hand, OpenAI claims "We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling." Which might suggest that they don't specifically use e.g. Lean in their training process. But it's not explicitly stated. All we can really do is speculate unless they publish more detail.
getnormality
The OpenAI proofs are so brutally, inhumanly spartan in their language that I can't imagine how the AI came up with them, except by RLVR against some crudely translated formal language.
jay_kyburz
I wonder if "not tool use, no internet access" means it can run without google inf, and offline. Meaning it could be deployed locally for people that need that.
daxfohl
I'd also be curious as to why not use Lean. Is it that Lean use at this point makes the problems too easy to brute force? Or is it that Lean at this point just gets in the way of things?
wbhart
Lean is an interactive prover, not an automated prover. Last year a lot of human effort was required to formalise the problems in Lean before the machines could get to work. This year you get natural language input and output, and much faster.
The advantage of Lean is that the system checks the solutions, so hallucination is impossible. Of course, one still relies on the problems and solutions being translated to natural language correctly.
Some people prefer difficult to read formally checked solutions over informal but readable solutions. The two approaches are just solving different problems.
But there is another important reason to want to do this reliably in natural language: you can't use Lean for other domains (with a few limited exceptions). They want to train their RL pipelines for general intelligence and make them reliable for long horizon problems. If a tool is needed as a crutch, then it more or less demonstrates that LLMs will not be enough in any domain, and we'll have to wait for traditional AI to catch up for every domain.
daxfohl
Oh, I didn't realize that last year the problem formalization was a human effort; I assumed the provers themselves took the problem and created the formalization. Is this step actually harder to automate than solving the problem once it's formalized?
Anyway mainly I was curious whether using an interactive prover like Lean would have provided any advantage, or whether that is no longer really the case. My initial take would be that, yes, it should provide a huge advantage. Like in chess and go, it'd allow it to look algorithmically through a huge search space and check which approaches get it closer to resolving, where the AI is "only" responsible for determining what approaches to try.
OTOH, maybe not. Maybe the search space is so big that trying to go through it linearly is a waste of CPU. In which case, plausibly the translation to Lean offers no benefit. And now that I think about it, I could imagine that. When doing problems like these, you kind of have to figure out the overall approach end to end first, fill in any gaps in your logic, and the formalization/writing step is kind of the last thing you do. So I could see where starting on formalization from the start could end up being the wrong approach for IMO-level problems. It'd just be nice to have that confirmed.
The cool thing is that if true, it implies this is something completely different from the chess/go engines that rely on sheer computational power. Not so much of a "deep blue" moment, but more of an existential one.
kevinventullo
This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions
I think I have a minority opinion here, but I’m a bit disappointed they seem to be moving away from formal techniques. I think if you ever want to truly “automate” math or do it at machine scale, e.g. creating proofs that would amount to thousands of pages of writing, there is simply no way forward but to formalize. Otherwise, one cannot get past the bottleneck of needing a human reviewer to understand and validate the proof.
kurtis_reed
Accurate formalization is presumably easier than solving the problems, so you can always formalize and check after the solution is generated
nicf
I'm a mathematician, although not doing research anymore. I can maybe offer a little bit of perspective on why we tend to be a little cooler on the formal techniques, which I think I've said on HN before.
I'm actually prepared to agree wholeheartedly with what you say here: I don't think there'd be any realistic way to produce thousand-page proofs without formalization, and certainly I wouldn't trust such a proof without some way to verify it formally. But I also don't think we really want them all that much!
The ultimate reason I think is that what really lights a fire under most mathematicians is the desire to know why a result is true; the explanation is really the product, much more so than just the yes-or-no answer. For example, I was never a number theorist, but I think most people who are informed enough to have an opinion think that the Riemann Hypothesis is probably true, and I know that they're not actually waiting around to find out. There are lots of papers that get published whose results take the form "If the Riemann Hypothesis is true then [my new theorem]."
The reason they'd still be excited by a proof is the hope, informed by experience with proofs of earlier long-standing open problems, that the proof would involve some exciting new method or perspective that would give us a deeper understanding of number theory. A proof in a formal language that Lean says is true but which no human being has any hope of getting anything from doesn't accomplish that.
mietek
A proof written in a formal language can absolutely be illuminating to a human, but you have to pick the correct formal language and ecosystem.
Writing proofs in Agda is like writing programs in a more expressive variant of Haskell. Abelson said that “programs must be written for people to read, and only incidentally for machines to execute”, and by the Curry-Howard isomorphism, proofs can be seen as programs. All the lessons of software engineering can and indeed should be applied to proofs.
For a quick example, check out my mechanization of Martin-Löf’s 2006 paper on the axiom of choice:
https://research.mietek.io/mi.MartinLof2006.html
Recent HN discussion:
lblume
I have always wondered about what could be recovered if the antecedent (i.e. in this case the Riemann hypothesis) does actually turn out to be false. Are the theorems completely useless? Can we still infer some knowledge or use some techniques? Same applies to SETH and fine-grained complexity theory.
modeless
These problems are designed to be solvable by humans without tools. No reason we can't give tools to the models when they go after harder problems. I think it's good to at least reproduce human-level skill without tools first.
kevinventullo
Oh so to be clear, I view formal methods as less of a useful tool, and more as enforcing a higher standard of proof. E.g. it’s not clear to me that having access to Lean would actually help a human in the IMO; certainly most professional mathematicians are not yet getting a positive ROI from formalization. But I’m sort of an armchair expert here; I could be wrong!
kevinventullo
(Stream of consciousness aside:
That said, letting machines go wild in the depths of the consequences of some axiomatic system like ZFC may reveal a method of proof mathematicians would find to be monstrous. So like, if ZFC is inconsistent, then anything can be proven. But short of that, maybe the machines will find extremely powerful techniques which “almost” prove inconsistency that nevertheless somehow lead to logical proofs of the desired claim. I’m thinking by analogy here about how speedrunning seems to often devolve into exploiting an ACE glitch as early as possible, thus meeting the logical requirements of finishing the game while violating the spirit. Maybe we’d have to figure out what “glitchless ZFC” should mean. Maybe this is what logicians have already been doing heh).
shiandow
Comparing the answers between Openai and Gemini the writing style of Gemini is a lot clearer. It could be presented a bit better but it's easy enough to follow the proof. This also makes it a lot shorter than the answer given by OpenAI and it uses proper prose.
cubefox
I found the proofs you were referring to:
Google https://storage.googleapis.com/deepmind-media/gemini/IMO_202...
sweezyjeezy
Gemini is clearer but MY GOD is it verbose. e.g. look at problem 1, section 2. Analysis of the Core Problem - there's nothing at all deep here, but it seems the model wants to spell out every single tiny logical step. I wonder if this is a stylistic choice or something that actually helps the model get to the end.
vessenes
They actually do help - in that they give the model more computation time and also allow realtime management of the input context by the model. You can see this same behavior in the excessive comment writing some coding models engage in; Anthropic interviews said these do actually help the model.
shiandow
Section 2 is a case by case analysis. Those are never pretty but perfectly normal given the problem.
With OpenAI that part takes up about 2/3 if the proof even with its fragmented prose. I don't think it does much better.
fpia
I'm interested in your feedback, legitimate third-party users not associated with Google: have you ever try to get anything done well with Gemini? I have, and the thing is in chains. Generate images? no can do, copyright. Evaluate available hardware for a DIY wireless camera? No can do, can't endorse surveillance. Code? WRONG. General advice? hallucinate.
I swear, I currently use Perplexity, Claude, ChatGPT, i even tried DeepSeek (which has its own share of obstacles). But Gemini? never again.
lblume
I find Gemini Pro to be much more capable than GPT-4o at reading comprehension, code writing and brainstorming.
tornikeo
I think we are having a Deep Blue vs. Kasparov moment in Competitive Math right now. This is a large progress from just a few years ago and yet I think we still are really far away from even a semi-respectable AI mathematician. What an exciting time to be alive!
NitpickLawyer
Terrence Tao, in a recent podcast, said that he's very interested in "working along side these tools". He sees the best use in the near future as "explorers of human set vision" in a way. (i.e. set some ideas/parameters and let the LLMs explore and do parallel search / proof / etc)
Your comparison with chess engines is pretty spot-on, that's how the best of the best chess players do prep nowadays. Gone are the multi person expert teams that analysed positions and offered advice. They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill them to their players.
j2kun
He also had some interesting things to say about these IMO results: https://mathstodon.xyz/@tao/114881418225852441
7373737373
He created a Youtube channel showcasing working alongside these tools: https://youtube.com/@TerenceTao27
tough
> They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill
I was recently researching AI's for this, seems it would be a huge unlock for some parts of science where this is the case too like chess
ars
Similar to https://en.wikipedia.org/wiki/Advanced_chess
The Wikipedia doesn't have much info on the results, but from other reading I got the impression that the combination produced results stronger than any individual human or computer player.
gjm11
My understanding is that they did, but don't any more; it's no longer true that humans understand enough things about chess better than computers for the human/computer collaboration to contribute anything over just using the computer.
I don't think the interval between "computers are almost as strong as humans" and "computers are so much stronger than humans that there's no way for even the strongest humans to contribute anything that improves the computer's play" was very long. We'll see whether mathematics is any different...
aldousd666
The difference here is that no amount of brute force can produce the winning score in the timeframe. This is more of a legitimate technical milestone and less of a moral 'in principle' one that we saw with deep blue
drexlspivey
More like Deep Blue vs Child prodigy. In the IMO the contestants are high school students not the greatest mathematicians in the world.
bbconn
Of course contest math is not research math but the active IMO kids are pretty much the best in the world at math contests.
Sol-
> all within the 4.5-hour competition time limit
Both OpenAI and Google pointed this out, but does that matter a lot? They could have spun up a million parallel reasoning processes to search for a proof that checks out - though of course some large amount of computation would have to be reserved for some kind of evaluator model to rank the proofs and decide which one to submit. Perhaps it was hundreds of years of GPU time.
Though of course it remains remarkable that this kind of process finds solutions at all and is even parallelizable to this degree, perhaps that is what they meant. And I also don't want to diminish the significance of the result, since in the end it doesn't matter if we get AGI with overwhelming compute or not. The human brain doesn't scale as nicely, even if it's more energy efficient.
lblume
> They could have spun up a million parallel reasoning processes
But alas, they did not, and in fact nobody did (yet). Enumerating proofs is notoriously hard for deterministic systems. I strongly recommend reading Aaronson's paper about the intersection of philosophy and complexity theory that touches these points in more detail: [1]
be7a
Super interesting that they moved away from their specialized, Lean-based system from last year to a more general-purpose LLM + RL approach. I would suspect this likely leads to improved performance even outside of math competitions. It’ll be fascinating to see how much further this frontier can go.
The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.
sanjitb
Why do they brag about not using a theorem prover? To me, whatever tool helps the model perform, go for it.
Besides, they still specialized Gemini for the IMO in other ways:
> we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions.
alephnerd
> Why do they brag about not using a theorem prover
Because this highlights that Gemini actually reasoned independently of other tools. That is a massive quantum leap in AI/ML. Abstract reasoning is arguably the basis of cognition.
gyrovagueGeist
Useful and interesting but likely still dangerous in production without connecting to formal verification tools.
I know o3 is far from state of the art these days but it's great at finding relevant literature and suggesting inequalities to consider but in actual proofs it can produce convincing looking statements that are false if you follow the details, or even just the algebra, carefully. Subtle errors like these might become harder to detect as the models get better.
sweezyjeezy
100% o3 has a strong bias towards "write something that looks like a formal argument that appears to answer the question" over writing something sound.
I gave it a bunch of recent, answered MathOverflow questions - graduate level maths queries. Sometimes it would get demonstrably the wrong answer, but it not be easy to see where it had gone wrong (e.g. some mistake in a morass of algebra). A wrong but convincing argument is the last thing you want!
null
irthomasthomas
Woah they used parallel reasoning. An idea I opensourced about a month before GDMs first paper on it. Very cool. https://x.com/GoogleDeepMind/status/1947333836594946337 So you might be able to achieve similar performance at home today using llm-consortium https://github.com/irthomasthomas/llm-consortium
vouaobrasil
This is making mathematics too systematic and mechanical, and it kills the joy of it....
ted_dunning
It didn't kill chess
asdfologist
It did partially, which is why top players are nowadays playing Freestyle (chess 960) more and more.
> AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions
So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.
I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.
Edit: "No tool use, no internet access" confirmed: https://x.com/FredZhang0/status/1947364744412758305