Gemini 3 Pro: the frontier of vision AI
44 comments
·December 5, 2025djoldman
simonw
I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...
jasonjmcghee
That is... astronomically different. Is GPT-5.1 downscaling and losing critical information or something? How could it be so different?
agentifysh
impressive.....most impressive
its going to reach low 90s very soon if trends continue
knollimar
I do some electrical drafting work for construction and throw basic tasks at LLMs.
I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon
fngjdflmdflg
These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.
simonw
In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.
TechRemarker
Love how employee portals for many companies essentially never get updated design wise over the decades, lol. That page styling and the balls certainly take me back.
jamiek88
Wow yeah. Flashbacks to when Gmail Invites were cool! Google too.
ed
Same with "See prompt in Google AI Studio" which links to an unpublished prompt in AI Studio.
pseudosavant
I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.
minimaxir
The actual token calculations with input videos for Gemini 3 Pro is...confusing.
devinprater
Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.
hodder
"Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."
Prompt: "wine glass full to the brim"
Image generated: 2/3 full wine glass.
True visual and spatial reasoning denied.
spchampion2
I actually did this prompt and found that it worked with a single nudge on a followup prompt. My first shot got me a wine glass that was almost full but not quite. I told it I wanted it full to the top - another drop would overflow. The second shot was perfectly full.
RyJones
The correction I expect to give to an intern, not a junior person.
minimaxir
Gemini 3 Pro is not Nano Banana Pro, and the image generation/model that decodes the generated image tokens may not be as robust.
The thinking step of Nano Banana Pro can refine some lateral steps (i.e. the errors in the homework correction and where they are spatially in the image) but it isn't perfect and can encounter some of the typical pitfalls. It's a lot better than Nano Banana base, though.
hodder
As a consumer I typed this into "Gemini". The behind the scenes model selection just adds confusion.
If "AI" trust is the big barrier for widespread adoption to these products, Alphabet soup isn't the solution (pun intended).
ed
What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)
minimaxir
Nothing new, it's just highlighting practical vision use cases.
agentifysh
im realizing how much of a bottleneck vision models are
im just a glorified speedreadin' promptin' QA at this point with codex
once it replaces the QA layer its truly over for software dev jobs
future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"
edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex
alex1138
> once it replaces the QA layer its truly over for software dev jobs
Maybe. However, with CYA requirements being everywhere in industry, there would have to be 100 waiver forms signed. I-promise-not-to-sue-company-if-AI-deletes-the-entire-database
It won't happen for that reason alone. Oh who am I kidding of course it will
hklrekeclhkle
[dead]
null
siva7
Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.
minimaxir
That's more of an issue with Nano Banana Pro than with Gemini 3 Pro.
siva7
What's the difference? I thought the vision ai component of gemini 3 is called nano banana?
IanCal
That’s about generating images, the other side is about understanding images.
brokensegue
i assumed nano banana was just a tool that gemini 3 used though i don't know
causal
Okay maybe this one isn't an exaggeration when they say leap forward
Interesting "ScreenSpot Pro" results:
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Usehttps://arxiv.org/abs/2504.07981