Skip to content(if available)orjump to list(if available)

Gemini 3 Pro Model Card

Gemini 3 Pro Model Card

84 comments

·November 18, 2025

scrlk

Benchmarks from pg 4 of the system card:

    | Benchmark                         | 3 Pro     | 2.5 Pro | Sonnet 4.5 | GPT-5.1   |
    |---------------------------------- |-----------|---------|------------|-----------|
    | Humanity’s Last Exam              | 37.5%     | 21.6%   | 13.7%      | 26.5%     |
    | ARC-AGI-2                         | 31.1%     | 4.9%    | 13.6%      | 17.6%     |
    | GPQA Diamond                      | 91.9%     | 86.4%   | 83.4%      | 88.1%     |
    | AIME 2025 (no tools)              | 95.0%     | 88.0%   | 87.0%      | 94.0%     |
    |           (with code execution)   | 100%      | —       | 100%       | —         |
    | MathArena Apex                    | 23.4%     | 0.5%    | 1.6%       | 1.0%      |
    | MMMU-Pro                          | 81.0%     | 68.0%   | 68.0%      | 80.8%     |
    | ScreenSpot-Pro                    | 72.7%     | 11.4%   | 36.2%      | 3.5%      |
    | CharXiv Reasoning                 | 81.4%     | 69.6%   | 68.5%      | 69.5%     |
    | OmniDocBench 1.5                  | 0.115     | 0.145   | 0.145      | 0.147     |
    | Video-MMMU                        | 87.6%     | 83.6%   | 77.8%      | 80.4%     |
    | LiveCodeBench Pro                 | 2,439     | 1,775   | 1,418      | 2,243     |
    | Terminal-Bench 2.0                | 54.2%     | 32.6%   | 42.8%      | 47.6%     |
    | SWE-Bench Verified                | 76.2%     | 59.6%   | 77.2%      | 76.3%     |
    | t2-bench                          | 85.4%     | 54.9%   | 84.7%      | 80.2%     |
    | Vending-Bench 2                   | $5,478.16 | $573.64 | $3,838.74  | $1,473.43 |
    | FACTS Benchmark Suite             | 70.5%     | 63.4%   | 50.4%      | 50.8%     |
    | SimpleQA Verified                 | 72.1%     | 54.5%   | 29.3%      | 34.9%     |
    | MMLU                              | 91.8%     | 89.5%   | 89.1%      | 91.0%     |
    | Global PIQA                       | 93.4%     | 91.5%   | 90.1%      | 90.9%     |
    | MRCR v2 (8-needle) (128k avg)     | 77.0%     | 58.0%   | 47.1%      | 61.6%     |
    |                    (1M pointwise) | 26.3%     | 16.4%   | n/s        | n/s       |
n/s = not supported

EDIT: formatting, hopefully a bit more mobile friendly

Alifatisk

These numbers are impressive, at least to say. It looks like Google has produced a beast that will raise the bar even higher. What's even more impressive is how Google came into this game late and went from producing a few flops to being the leader at this (actually, they already achieved the title with 2.5 Pro).

What makes me even more curious is the following

> Model dependencies: This model is not a modification or a fine-tune of a prior model

So did they start from scratch with this one?

postalcoder

Google was never really late. Where people perceived Google to have dropped the ball was in its productization of AI. Google's Bard branding was such a hilarious misstep.

My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.

Alifatisk

Oh, I remember the times when I compared Gemini with ChatGPT and Claude. Gemini was so far behind, it was barely usable. And now they are pushing the boundries.

basch

At least at the moment, coming in late seems to matter little.

Anyone with money can trivially catch up to a state of the art model from six months ago.

benob

What does it mean nowadays to start from scratch? At least in the open scene, most of the post-training data is generated by other LLMs.

Alifatisk

They had to start with a base model, that part I am certain of

falcor84

That looks impressive, but some of the are a bit out of date.

On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.

HugoDias

very impressive. I wonder if this sends a different signal to the market regarding using TPUs for training SOTA models versus Nvidia GPUs. From what we've seen, OpenAI is already renting them to diversify... Curious to see what happens next

manmal

Looks like it will be on par with the contenders when it comes to coding. I guess improvements will be incremental from here on out.

CjHuber

If it’s on par in code quality, it would be a way better model for coding because of its huge context window.

mynti

It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

Palmik

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

HereBePandas

My understanding (not an expert) is that SWE Bench is a flawed benchmark (flawed unit tests, underspecified tasks, etc.), which is why variants like SWE Bench Verified exist [1].

I suspect that, given these known issues plus the fact that SWE Bench progress has stalled while models improve on Terminal Bench and other good benchmarks, that SWE Bench is now saturated and probably should be retired.

[1] https://epoch.ai/benchmarks/swe-bench-verified

Palmik

The reported results where GPT 5.1 beats Gemini 3 are on SWE Bench Verified, and GPT 5.1 Codex also beats Gemini 3 on Terminal Bench.

tosh

This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

embedding-shape

Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/emailAddress=info@allot.com` which obviously fails...

Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?

Fornax96

Creator of pixeldrain here. I have no idea why my site is blocked in Spain, but it's a long running issue.

I actually never discovered who was responsible for the blockade, until I read this comment. I'm going to look into Allot and send them an email.

EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.

zozbot234

Could it be that some site in your network neighborhood was illegally streaming soccer matches?

amarcheschi

That website is used to share everything including pirated things, so that's the reason maybe

denysvitali

Title of the document is "[Gemini 3 Pro] External Model Card - November 18, 2025 - v2", in case you needed further confirmation that the model will be released today.

Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well

jmkni

what is Google Antigravity?

Yossarrian22

The ASI figured out zero point energy from first principles

denysvitali

I guess we'll know it in a few hours. Most likely another AI playground or maybe a Google Search alternative? No clue really

transcriptase

There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.

swalsh

You're absolutely right

jstummbillig

Does not get old.

Lord-Jobo

And have the score heavily modified based on how fixable the sycophancy is.

postalcoder

I care very little about model personality outside of sycophancy. The thing about gemini is that it's notorious for its low self esteem. Given that thing is trained from scratch, I'm very curious to see how they've decided to take it.

supjeff

given how often these llms are wrong, doesnt it make sense that they are less confident?

BoredPositron

Your comment demonstrates a remarkably elevated level of cognitive processing and intellectual rigor. Inquiries of this caliber are indicative of a mind operating at a strategically advanced tier, displaying exceptional analytical bandwidth and thought-leadership potential. Given the substantive value embedded in your question, it is operationally imperative that we initiate an immediate deep-dive and execute a comprehensive response aligned with the strategic priorities of this discussion.

lxdlam

What does the "Google Antigravity" mean? The link is http://antigravity.google/docs, seemingly a new product but now routing to the Google main page.

dbosch

I was asking myself the exact same question. No idea

mohsen1

     This model is not a modification or a fine-tune of a prior model

Is that common to mention that? Feels like they built something from scratch

scosman

I think they are just indicating it’s a new architecture vs continued training of 2.5 series.

irthomasthomas

Never seen it before. I suppose it adds to the excitement.

laborcontract

It's hilarious that the release of Gemini 3 is getting eclipsed by this cloudflare outage.

yen223

Coincidence? Yes

null

[deleted]

senordevnyc

It hasn't been released, this is just a leak

margorczynski

If these numbers are true then OpenAI is probably done, Anthropic too. Still, it's hard to see an effective monetization method for this tech and it clearly is eating Google's main pie which is search.

alecco

For SWE it is the same ranking. But if Google's $20/mo plan is comparable to the $100-200 plans for OpenAI and Anthropic, yes they are done.

But we'll have to wait a few weeks to see if the nerfed model post-release is still as good.

null

[deleted]

Sol-

Why? These models just leapfrog each other as time advances.

One month Gemini is on top, then ChatGPT, then Anthropic. Not sure why everyone gets FOMO whenever a new version gets released.

remus

I think google is uniquely well placed to make a profitable business out of AI: They make their own TPUs so don't have to pay ridiculous amounts of money to Nvidia, they have a great depth of talent in building models, they've got loads of data they can use for training and they've got a huge existing customer base who can buy their AI offerings.

I don't think any other company has all these ingredients.

spaceman_2020

The bear case for Google was always the business side would cannibalize the AI side. AI makes search redundant which kills the golden goose

gizmodo59

While I don’t disagree that Google is the company you can’t bet against when it comes to AI, saying other companies are done is a stretch. If they have a significant moat then they should be at the top all the time by then which is not the case though.

mlnj

100% the reason I am long on Google. They can take their time to monetize these new costs.

Even other search competitors have not proven to be a danger to Google. There is nothing stopping that search money coming in.

redox99

Considering GPT 5 was only recently released, it's very unlikely GPT will achieve these scores in just a couple of months. If they had something this good in the oven, they'd probably left the GPT 5 name to it.

Or maybe Google just benchmaxxed and this doesn't translate at all in real world performance.

blueblisters

They do have unreleased Olympiad Gold-winning models that are definitely better than GPT5.

TBD if that performance generalizes to other real world tasks.

Palmik

GPT 5 was released more than 3 months ago. Gemini 2.5 was released less than 8 months ago.

lukev

Or else it trained/overfit to the benchmarks. We won't really know until people have a chance to use it for real-world tasks.

Also, models are already pretty good but product/market fit (in terms of demonstrated economic value delivered) remains elusive outside of a couple domains. Does a model that's (say) 30% better reach an inflection point that changes that narrative, or is a more qualitative change required?

ilaksh

The only one it doesn't win is SWE bench which it is significantly behind Claude Sonnet. You just can't take down Sonnet.

svantana

One percentage point is not significant, neither in the colloquial nor the scientific sense[1].

[1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%

stavros

Codex has been much better than Sonnet for me.

dotancohen

On what types of tasks?

happa

This may just be bad recollection from my part, but hasn't Google reported that their search business is right now the most profitable it has ever been?

senordevnyc

1) New SOTA models come out all the time and that hasn't killed the other major AI companies. This will be no different.

2) Google's search revenue last quarter was $56 billion, a 14% increase over Q3 2024.

margorczynski

1) Not long ago Altman and the OpenAI CFO were openly asking for public money. None of these AI companies have actually any kind of working business plan and are just burning investor money. If the investors see there is no winning against Google (or some open Chinese model) the money will dry up.

2) I'm not suggesting this will happen overnight but especially younger people gravitate towards LLM for information search + actively use some sort of ad blocking. In the long run it doesn't look great for Google.

paswut

I'd love to see anthropic/openai pop. back to some regular programming. the models are good enough, time to invest elsewhere

oalessandr

Trying to open this link from Italy leads to a CSAM warning

Fornax96

Creator of pixeldrain here. Italy has been doing this for a very long time. They never notified me of any such material being present on my site. I have a lot of measures in place to prevent the spread of CSAM. I have sent dozens of mails to Polizia Postale and even tried calling them a few times, but they never respond. My mails go unanswered and they just hang up the phone.

driverdan

Don't use your ISP's DNS. Switch to something outside of their control.

Jowsey

Pixeldrain is a free anonymous file host, which unfortunately goes hand-in-hand with this kind of thing.