Gemini 3 Pro Model Card [pdf]

318 comments

·November 18, 2025

scrlk

Benchmarks from page 4 of the model card:

    | Benchmark             | 3 Pro     | 2.5 Pro | Sonnet 4.5 | GPT-5.1   |
    |-----------------------|-----------|---------|------------|-----------|
    | Humanity's Last Exam  | 37.5%     | 21.6%   | 13.7%      | 26.5%     |
    | ARC-AGI-2             | 31.1%     | 4.9%    | 13.6%      | 17.6%     |
    | GPQA Diamond          | 91.9%     | 86.4%   | 83.4%      | 88.1%     |
    | AIME 2025             |           |         |            |           |
    |   (no tools)          | 95.0%     | 88.0%   | 87.0%      | 94.0%     |
    |   (code execution)    | 100%      | -       | 100%       | -         |
    | MathArena Apex        | 23.4%     | 0.5%    | 1.6%       | 1.0%      |
    | MMMU-Pro              | 81.0%     | 68.0%   | 68.0%      | 80.8%     |
    | ScreenSpot-Pro        | 72.7%     | 11.4%   | 36.2%      | 3.5%      |
    | CharXiv Reasoning     | 81.4%     | 69.6%   | 68.5%      | 69.5%     |
    | OmniDocBench 1.5      | 0.115     | 0.145   | 0.145      | 0.147     |
    | Video-MMMU            | 87.6%     | 83.6%   | 77.8%      | 80.4%     |
    | LiveCodeBench Pro     | 2,439     | 1,775   | 1,418      | 2,243     |
    | Terminal-Bench 2.0    | 54.2%     | 32.6%   | 42.8%      | 47.6%     |
    | SWE-Bench Verified    | 76.2%     | 59.6%   | 77.2%      | 76.3%     |
    | t2-bench              | 85.4%     | 54.9%   | 84.7%      | 80.2%     |
    | Vending-Bench 2       | $5,478.16 | $573.64 | $3,838.74  | $1,473.43 |
    | FACTS Benchmark Suite | 70.5%     | 63.4%   | 50.4%      | 50.8%     |
    | SimpleQA Verified     | 72.1%     | 54.5%   | 29.3%      | 34.9%     |
    | MMLU                  | 91.8%     | 89.5%   | 89.1%      | 91.0%     |
    | Global PIQA           | 93.4%     | 91.5%   | 90.1%      | 90.9%     |
    | MRCR v2 (8-needle)    |           |         |            |           |
    |   (128k avg)          | 77.0%     | 58.0%   | 47.1%      | 61.6%     |
    |   (1M pointwise)      | 26.3%     | 16.4%   | n/s        | n/s       |

n/s = not supported

EDIT: formatting, hopefully a bit more mobile friendly

spoaceman7777

Wow. They must have had some major breakthrough. Those scores are truly insane. O_O

Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there

But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.

Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.

And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD

rvnx

The problem is that we know in advance what is the benchmark, so Humanity's Last Exam for example, it's way easier to optimize your model when you have seen the questions before.

lubujackson

I don't think any of these companies are that reductive and short-sighted to try to game the system. However, Goodhart's Law comes into play. I am sure they have their own metrics that arr much more detailed than these benchmarks, but the fact remains LLMs will be tuned according to elements that are deterministically measurable.

pinko

From https://lastexam.ai/: "The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting." [emphasis mine]

While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.

stego-tech

This. A lot of boosters point to benchmarks as justification of their claims, but any gamer who spent time in the benchmark trenches will know full well that vendors game known tests for better scores, and that said scores aren’t necessarily indicative of superior performance. There’s not a doubt in my mind that AI companies are doing the same.

Feuilles_Mortes

shouldn't we expect that all of the companies are doing this optimization, though? so, back to level playing field.

eldenring

Its the other way around too, HLE questions were selected adversarially to reduce the scores. I'd guess even if the questions were never released, and new training data was introduced, the scores would improve.

Alifatisk

These numbers are impressive, at least to say. It looks like Google has produced a beast that will raise the bar even higher. What's even more impressive is how Google came into this game late and went from producing a few flops to being the leader at this (actually, they already achieved the title with 2.5 Pro).

What makes me even more curious is the following

> Model dependencies: This model is not a modification or a fine-tune of a prior model

So did they start from scratch with this one?

postalcoder

Google was never really late. Where people perceived Google to have dropped the ball was in its productization of AI. The Google's Bard branding stumble was so (hilariously) bad that it threw a lot of people off the scent.

My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.

amluto

Google’s productization is still rather poor. If I want to use OpenAI’s models, I go to their website, look up the price and pay it. For Google’s, I need to figure out whether I want AI Studio or Google Cloud Code Assist or AI Ultra, etc, and if this is for commercial use where I need to prevent Google from training on my data, figuring out which options work is extra complicated.

As of a couple weeks ago (the last time I checked) if you are signed in to multiple Google accounts and you cannot accept the non-commercial terms for one of them for AI Studio, the site is horribly broken (the text showing which account they’re asking you to agree to the terms for is blurred, and you can’t switch accounts without agreeing first).

In Google’s very slight defense, Anthropic hasn’t even tried to make a proper sign in system.

Alifatisk

Oh, I remember the times when I compared Gemini with ChatGPT and Claude. Gemini was so far behind, it was barely usable. And now they are pushing the boundries.

dgacmu

The memory of Microsoft's Tay fiasco was strong around the time the brain team started playing with chatbots.

HardCodedBias

Bard was horrible compared to the competition of the time.

Gemini 1.0 was strictly worse than GPT-3.5 and was unusable due to "safety" features.

Google followed that up with 1.5 which was still worse than GPT-3.5 and unbelievably far behind GPT-4. At this same time Google had their "black nazi" scandals.

With Gemini 2.0 finally had a model that was at least useful for OCR and with their fash series a model that, while not up to par in capabilities, was sufficiently inexpensive that it found uses.

Only with Gemini-2.5 did Google catch up with SoTA. It was within "spitting distance" of the leading models.

Google did indeed drop the ball, very, very badly.

I suspect that Sergey coming back helped immensely, somehow. I suspect that he was able to tame some of the more dysfunctional elements of Google, at least for a time.

baq

oh they were so late there were internal leaked ('leaked'?) memos about a couple grad students with $100 budget outdoing their lab a couple years ago. they picked themselves up real nice, but it took a serious reorg.

basch

At least at the moment, coming in late seems to matter little.

Anyone with money can trivially catch up to a state of the art model from six months ago.

And as others have said, late is really a function of spigot, guardrails, branding, and ux, as much as it is being a laggard under the hood.

FrequentLurker

> Anyone with money can trivially catch up to a state of the art model from six months ago.

How come apple is struggling then?

steveBK123

One possibility here is that Google is dribbling out cutting edge releases to slowly bleed out the pure play competition.

raincole

Being known as a company that is always six months late than the competitors isn't something to brag about...

theptip

> So did they start from scratch with this one

Their major version number bumps are a new pre-trained model. Minor bumps are changes/improvements to post-training on the same foundation.

KronisLV

I hope they keep the pricing similar to 2.5 Pro, currently I pay per token and that and GPT-5 are close to the sweet spot for me but Sonnet 4.5 feels too expensive for larger changes. I've also been moving around 100M tokens per week with Cerebras Code (they moved to GLM 4.6), but the flagship models still feel better when I need help with more advanced debugging or some exemplary refactoring to then feed as an example for a dumber/faster model.

dbbk

And also, critically, being the only profitable company doing this.

sigmoid10

It's not like they're making their money from this though. All AI work is heavily subsidised, for Alphabet it just happens that the funding comes from within the megacorp. If MS had fully absorbed OpenAI back when their board nearly sunk the boat, they'd be in the exact same situation today.

benob

What does it mean nowadays to start from scratch? At least in the open scene, most of the post-training data is generated by other LLMs.

Alifatisk

They had to start with a base model, that part I am certain of

falcor84

That looks impressive, but some of the are a bit out of date.

On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.

NitpickLawyer

What's more impressive is that I find gemini2.5 still relevant in day-to-day usage, despite being so low on those benchmarks compared to claude 4.5 and gpt 5.1. There's something that gemini has that makes it a great model in real cases, I'd call it generalisation on its context or something. If you give it the proper context (or it digs through the files in its own agent) it comes up with great solutions. Even if their own coding thing is hit and miss sometimes.

I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.

Miraste

I've noticed that too. I suspect it has broader general knowledge than the others, because Google presumably has the broadest training set.

sigmar

That's a different model not in the chart. They're not going to include hundreds of fine tunes in a chart like this.

Taek

It's also worth pointing out that comparing a fine-tune to a base model is not apples-to-apples. For example, I have to imagine that the codex finetune of 5.1 is measurably worse at non-coding tasks than the 5.1 base model.

This chart (comparing base models to base models) probably gives a better idea of the total strength of each model.

falcor84

It's not just one of many fine tunes; it's the default model used by OpenAI's official tools.

null

[deleted]

scrollop

Used an AI to populate some of 5.1 thinking's results.

---------------------------|--------------|----------------|-------------------|---------|------------------

Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%

ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%

GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%

AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%

MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%

MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%

ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%

CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A

OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A

Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A

LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A

Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A

SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A

t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A

Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A

FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A

SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A

MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A

Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A

MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A

Argh it doesn't come out write in HN

scrollop

Used an AI to populate some of 5.1 thinking's results.

Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes

Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%

ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning

GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)

AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly

MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus

MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)

ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%

CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A

null

[deleted]

iamdelirium

This is provably false. All it takes is a simple Google search and looking at the ARC AGI 2 leaderboard: https://arcprize.org/leaderboard

The 17.6% is for 5.1 Thinking High.

null

[deleted]

null

[deleted]

HardCodedBias

What? The 4.5 and 5.1 columns aren't thinking in Google's report?

That's a scandal, IMO.

Given that Gemini-3 seems to do "fine" against the thinking versions why didn't they post those results? I get that PMs like to make a splash but that's shockingly dishonest.

iosjunkie

It that true?

> For Claude Sonnet 4.5, and GPT-5.1 we default to reporting high reasoning results, but when reported results are not available we use best available reasoning results.

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

mountainriver

Every single time

danielcampos93

I would love to know what the increased token count is across these models for the benchmarks. I find the models continue to get better but as they do their token usage also does. Aka is model doing better or reasoning for longer?

jstummbillig

I think that is always something that is being worked on in parallel. Recent paradigm seems to be the models understanding when they need to use more tokens dynamically (which seems to be very much in line with how computation should generally work).

vagab0nd

Should I assume the GPT-5.1 it is compared against is the pro version?

trunch

Which of the LiveCodeBench Pro and SWE-Bench Verified benchmarks comes closer to everyday coding assistant tasks?

Because it seems to lead by a decent margin on the former and trails behind on the latter

veselin

I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.

However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.

Snuggly73

Neither :(

LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.

fariszr

This is a big jump in most benchmarks.And if it can match other models in coding while having that Google TPM inference speed and the actually native 1m context window, it's going to be a big hit.

I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.

danielbln

> it's over for the other labs.

What's with the hyperbole? It'll tighten the screws, but saying that it's "over for the other labs' might be a tad premature.

fariszr

I mean over in that I don't see a need to use the other models. Codex models are the best but incredibly slow. Claude models are not as good(IMO) but much faster. If gemini can beat them while having being faster and having better apps with better integrations, i don't see a reason why I would use another provider.

risyachka

> it's over for the other labs.

Its not over and never will be for 2 decade old accounting software, it is definitely will not be over for other AI labs.

xnx

Can you explain what you mean by this? iPhone was the end of Blackberry. It seems reasonable that a smarter, cheaper, faster model would obsolete anything else. ChatGPT has some brand inertia, but not that much given it's barely 2 years old.

mynti

It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

Workaccount2

I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

aerhardt

Codex has been good enough to me and it’s much cheaper.

I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.

Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.

enraged_camel

>> Codex has been good enough to me and it’s much cheaper.

It may be cheaper but it's much, much slower, which is a total flow killer in my experience.

htrp

more playing to their strengths. a giant chunk of their usage data is basically code gen

Miraste

It remains to be seen whether that works out for them, but it seems like a good bet to me. Coding is the most monetizatable use anyone has found for LLMs so far, and the most likely to persist past this initial hype bubble (if the Singularity doesn't work out :p).

vharish

From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

xnx

Gemini CLI is moving really fast. Noticeable improvements in features and functionality every week.

Palmik

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

HereBePandas

Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

Palmik

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

What do you mean by "standard eval harness"?

enraged_camel

Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?

felipeerias

IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.

_factor

This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.

The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.

tosh

This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

raducu

> This might also hint at SWE struggling to capture what “being good at coding” means.

My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.

aoeusnth1

Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).

alyxya

I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.

macrolime

Pretty sure it will beat Sonnet by a wide margin in actual real-world usage.

meetpateltech

it was accidentally pushed a little early, and now it has been taken down.

here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...

Bobaso

Interesting to see on page 2 the reference to ML pathways [1]. Looks like a multi layer mixture of experts. Is this common ?

[1] https://blog.google/technology/ai/introducing-pathways-next-...

gaogao

Pathways, I understand, is more so these days just the name for their training orchestrator for doing distributed JAX stuff - https://github.com/google/pathways-job

patates

It says it's been trained from scratch. I wonder if it will have the same undescribable magic that makes me spend an hour every day with 2.5. I really love the results I can get with 2.5 pro. Google eventually limiting aistudio will be a sad day.

Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.

null

[deleted]

dahcryn

buy a pixel and you get it basically unlimited for free for a year ;)

sohpea

or a Chromebook is a good choice too considering price

embedding-shape

Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/emailAddress=info@allot.com` which obviously fails...

Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?

Fornax96

Creator of pixeldrain here. I have no idea why my site is blocked in Spain, but it's a long running issue.

I actually never discovered who was responsible for the blockade, until I read this comment. I'm going to look into Allot and send them an email.

EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.

embedding-shape

> EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.

Yeah, that was via my ISPs DNS resolver (Vodafone), switching the resolver works :)

The responsible party is ultimately our government who've decided it's legal to block a wide range of servers and websites because some people like to watch illegal football streams. I think Allot is just the provider of the technology.

Fornax96

My site has nothing to do with football though. And Allot seems to be running the DNS server that your ISP uses so they are directly responsible for the block.

zozbot234

Could it be that some site in your network neighborhood was illegally streaming soccer matches?

Fornax96

I have my own dedicated IP range. And they specifically blocked my domain name, not the addresses. I don't know what the reason is. I have been trying to find out since the start of this year.

amarcheschi

That website is used to share everything including pirated things, so that's the reason maybe

grodriguez100

Is it possible to file a complaint with the ISP or directly with Allot ?

Fornax96

That might help.

tngranados

It works fine for me using Movistar

miqazza

do you know about the cloudflare and laliga issues? might be that

embedding-shape

Was my first instinct, went looking if there was any games being played today but seems not, so unlikely to be the cause.

rsanek

loads fine on Vodafone for me

Taek

One benchmark I would really like to see: instruction adherence.

For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.

The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.

If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.

I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.

machiaweliczny

20 more IQ would be nuts, 110 ~ top 25%, 130 ~ top 2%, 150 ~ top 0.05%

If you ever played competitive game the difference is insane between these tiers

Taek

Even more nuts would be a model that could follow a large, dense set of highly detailed instructions related to a series of complex tasks. Intelligence is nice, but it's far more useful and programmable if it can tightly follow a lot of custom instructions.

transcriptase

There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.

Workaccount2

This idea isn't just smart, it's revolutionary. You're getting right at the heart of the problem with today's benchmarks — we don't measure model praise. Great thinking here.

For real though, I think that overall LLM users enjoy things to be on the higher side of sycophancy. Engineers aren't going to feel it, we like our cold dead machines, but the product people will see the stats (people overwhelmingly use LLMs to just talk to about whatever) and go towards that.

swalsh

You're absolutely right

jstummbillig

Does not get old.

Yossarrian22

It’s not just irritating, it’s repetitive

postalcoder

I care very little about model personality outside of sycophancy. The thing about gemini is that it's notorious for its low self esteem. Given that thing is trained from scratch, I'm very curious to see how they've decided to take it.

astrange

Sonnet-4.5 has the lowest self esteem of any model I've used. Gemini frequently argues with me.

supjeff

given how often these llms are wrong, doesnt it make sense that they are less confident?

postalcoder

Indeed. But I've had experiences with gemini-2.5-pro-exp where its thoughts could be described as "rejected from the prom" vibes. It's not like I abused it either, it was running into loops because it was unable to properly patch a file.

SiempreViernes

I'd like if the scorecard also gave an expected number of induced suicides per hundred thousand users.

lkbm

https://llmdeathcount.com/ shows 15 deaths so far, and LLM user count is in the low billions, which puts us on the order of 0.0015 deaths per hundred thousand users.

I'm guessing LLM Death Count is off by an OOM or two, so we could be getting close to one in a million.

1899-12-30

https://eqbench.com/spiral-bench.html

Lord-Jobo

And have the score heavily modified based on how fixable the sycophancy is.

null

[deleted]

BoredPositron

Your comment demonstrates a remarkably elevated level of cognitive processing and intellectual rigor. Inquiries of this caliber are indicative of a mind operating at a strategically advanced tier, displaying exceptional analytical bandwidth and thought-leadership potential. Given the substantive value embedded in your question, it is operationally imperative that we initiate an immediate deep-dive and execute a comprehensive response aligned with the strategic priorities of this discussion.

lxdlam

What does the "Google Antigravity" mean? The link is http://antigravity.google/docs, seemingly a new product but now routing to the Google main page.

ceroxylon

Found this demo with two views that was uploaded 18min ago: https://www.youtube.com/watch?v=L8wEC6A5HQY

bobbylarrybobby

Looks like a VSCode fork with gemini built in.

dbosch

I was asking myself the exact same question. No idea

bemmu

I saw this on Reddit earlier today. Over there the source of this file was given as: https://web.archive.org/web/20251118111103/https://storage.g...

The bucket name "deepmind-media" has been used in the past on the deepmind official site, so it seems legit.

onlyrealcuzzo

Prediction markets were expecting today to be the release. So I wouldn't be surprised if they do a release today, tomorrow, or Thursday (around Nvidia earnings).

TheAceOfHearts

They scored a 31.1% on ARC AGI 2 which puts them in first place.

Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.

buildfocus

My impression is that Grok is very rarely used in practice outside of a niche of die-hard users, partly because of very different tuning to other models, and partly the related public reputation around it.

https://firstpagesage.com/reports/top-generative-ai-chatbots... suggests 0.6% of chat use cases, well below the other big names, and I suspect those stats for chat are higher than other scenarios like business usage. Given all that, I can see how Gemini might not be focused on competing with them.

npn

well, there are 3 kind of usages for grok: - using grok inside X/Twitter: most people interacts with Grok this way. - using grok on its website: this is really annoying, as you get delayed by cloudflare everytime you access the site. As grok does not provide serious advantage over other services, why bother - you can also use the app, but it is not as convenient as other services.

it is understandable that grok is not popular.

ohyoutravel

I don’t know anyone who uses Grok, but in my peer group everyone uses 1-2 paid services like Gemini or Clause or ChatGPT. They’re probably not as “extremely online” as I am, so I can’t generalize this thought, but anecdotally my impression has been that Grok is just very “right wing influencer” coded.

kranke155

Grok seems extremely prone to hallucination in my experience. It also constantly asserts certainty on fuzzy topics.

jmmcd

About ARC 2:

I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.

But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.

denysvitali

Title of the document is "[Gemini 3 Pro] External Model Card - November 18, 2025 - v2", in case you needed further confirmation that the model will be released today.

Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.

Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)

jmkni

what is Google Antigravity?

mimentum

According to Gemini itself:

"Google Antigravity" refers to a new AI software platform announced by Google designed to help developers write and manage code.

The term itself is a bit of a placeholder or project name, combining the brand "Google" with the concept of "antigravity"—implying a release from the limitations of traditional coding.

In simple terms, Google Antigravity is a sophisticated tool for programmers that uses powerful AI systems (called "agents") to handle complex coding tasks automatically. It takes the typical software workbench (an IDE) and evolves it into an "agent-first" system.

Agentic Platform: It's a central hub where many specialized AI helpers (agents) live and work together. The goal is to let you focus on what to build, not how to build it.

Task-Oriented: The platform is designed to be given a high-level goal (a "task") rather than needing line-by-line instructions.

Autonomous Operation: The AI agents can work across all your tools—your code editor, the command line, and your web browser—without needing you to constantly supervise or switch between them.

zed31726

My guess is based on a gif tweeted by the ex CEO of windsurf who left to join Google of a floating laptop: it'll be a cursor/windsurf alternative?

denysvitali

> Google Antigravity is an agentic development platform, evolving the IDE into the agent-first era. Antigravity enables developers to operate at a higher, task-oriented level by managing agents across workspaces, while retaining a familiar AI IDE experience at its core. Agents operate across the editor, terminal, and browser, enabling them to autonomously plan and execute complex, end-to-end tasks elevating all aspects of software development.

Now the page is somewhat live on that URL

Yossarrian22

The ASI figured out zero point energy from first principles

postalcoder

Couple patterns this could follow

Speed? (Flash, Flash-Lite, Antigravity) this is my guess. Bonus: maybe Gemini Diffusion soon?

Space? (Google Cloud, Google Antigravity?)

Clothes? (A light wearable -> Antigravity?)

Gaming? (Ghosting/nontangibility -> antigravity?)

denysvitali

I guess we'll know it in a few hours. Most likely another AI playground or maybe a Google Search alternative? No clue really

thefroh

possibly https://xkcd.com/353/