Gemini 3 Pro: the frontier of vision AI

djoldman

Interesting "ScreenSpot Pro" results:

    72.7% Gemini 3 Pro
    11.4% Gemini 2.5 Pro
    49.9% Claude Opus 4.5
    3.50% GPT-5.1

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

https://arxiv.org/abs/2504.07981

simonw

I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...

jasonjmcghee

That is... astronomically different. Is GPT-5.1 downscaling and losing critical information or something? How could it be so different?

agentifysh

impressive.....most impressive

its going to reach low 90s very soon if trends continue

knollimar

I do some electrical drafting work for construction and throw basic tasks at LLMs.

I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon

fngjdflmdflg

These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.

[0] https://annas-archive.org/blog/critical-window.html

simonw

In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.

TechRemarker

Love how employee portals for many companies essentially never get updated design wise over the decades, lol. That page styling and the balls certainly take me back.

jamiek88

Wow yeah. Flashbacks to when Gmail Invites were cool! Google too.

Same with "See prompt in Google AI Studio" which links to an unpublished prompt in AI Studio.

pseudosavant

I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.

minimaxir

The actual token calculations with input videos for Gemini 3 Pro is...confusing.

https://ai.google.dev/gemini-api/docs/media-resolution

devinprater

Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.

hodder

"Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."

Prompt: "wine glass full to the brim"

Image generated: 2/3 full wine glass.

True visual and spatial reasoning denied.

spchampion2

I actually did this prompt and found that it worked with a single nudge on a followup prompt. My first shot got me a wine glass that was almost full but not quite. I told it I wanted it full to the top - another drop would overflow. The second shot was perfectly full.

RyJones

The correction I expect to give to an intern, not a junior person.

minimaxir

Gemini 3 Pro is not Nano Banana Pro, and the image generation/model that decodes the generated image tokens may not be as robust.

The thinking step of Nano Banana Pro can refine some lateral steps (i.e. the errors in the homework correction and where they are spatially in the image) but it isn't perfect and can encounter some of the typical pitfalls. It's a lot better than Nano Banana base, though.

hodder

As a consumer I typed this into "Gemini". The behind the scenes model selection just adds confusion.

If "AI" trust is the big barrier for widespread adoption to these products, Alphabet soup isn't the solution (pun intended).

What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)

minimaxir

Nothing new, it's just highlighting practical vision use cases.

agentifysh

im realizing how much of a bottleneck vision models are

im just a glorified speedreadin' promptin' QA at this point with codex

once it replaces the QA layer its truly over for software dev jobs

future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"

edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex

alex1138

> once it replaces the QA layer its truly over for software dev jobs

Maybe. However, with CYA requirements being everywhere in industry, there would have to be 100 waiver forms signed. I-promise-not-to-sue-company-if-AI-deletes-the-entire-database

It won't happen for that reason alone. Oh who am I kidding of course it will

hklrekeclhkle

[dead]

null

[deleted]

siva7

Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.

minimaxir

That's more of an issue with Nano Banana Pro than with Gemini 3 Pro.

siva7

What's the difference? I thought the vision ai component of gemini 3 is called nano banana?

IanCal

That’s about generating images, the other side is about understanding images.

brokensegue

i assumed nano banana was just a tool that gemini 3 used though i don't know

causal

Okay maybe this one isn't an exaggeration when they say leap forward

HN

Gemini 3 Pro: the frontier of vision AI

Gemini 3 Pro: the frontier of vision AI