Skip to content(if available)orjump to list(if available)

Gemini 3 Pro Preview Live in AI Studio

Gemini 3 Pro Preview Live in AI Studio

119 comments

·November 18, 2025

ttul

My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.

valtism

Parakeet TDT v3 would be really good at that

iagooar

What prompt do you use for that?

punnerud

I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)

gregsadetsky

I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.

3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:

    [00:00] Greg: Hello.
    [00:01] X: You great?
    [00:02] Greg: Hi.
    [00:03] X: I'm X.
    [00:04] Y: I'm Y.
    ...
Super impressive!

prodigycorp

I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark, which involves type analysis. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).

For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).

This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.

WhitneyLand

>>benchmarks are meaningless

No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.

>>my fairly basic python benchmark

I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.

thefourthchime

I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.

ofa0e

Your benchmarks should not involve IP.

ComplexSystems

Why? This seems like a reasonable task to benchmark on.

dekhn

Using a single custom benchmark as a metric seems pretty unreliable to me.

Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

sosodev

How can you be sure that your benchmark is meaningful and well designed?

Is the only thing that prevents a benchmark from being meaningful publicity?

prodigycorp

I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.

I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.

My benchmark says that I will stick with Codex CLI for the foreseeable future.

gregsadetsky

I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?

I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?

Thanks

[0] https://news.ycombinator.com/item?id=45968665

luckydata

I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.

ddalex

I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code

benterix

> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.

Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.

adastra22

I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.

Iulioh

A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....

GPT4/3o might be the best we will ever have

testartr

and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much

it's easy to focus on what they can't do

prodigycorp

Except I'm not nit picking at some limitations with tokenization, like "how many as are there in strawberry" . If you "understand" the code, you shouldn't be getting it wrong.

Workaccount2

It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.

Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.

Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.

lukebechtel

ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.

__jl__

API pricing is up to $2/M for input and $12/M for output

For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

raincole

Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.

brianjking

It is so impressive that Anthropic has been able to maintain this pricing still.

Aeolun

Because every time I try to move away I realize there’s nothing equivalent to move to.

fosterfriends

Thrilled to see the cost is competitive with Anthropic.

jhack

With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.

xnx

There's a waitlist for using Gemini 3 for Gemini CLI free users: https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...

hirako2000

Google went from the phase of loss leader, to bait-to-switch.

They have started lock in with studio, I would say they are still in market penetration but stakeholders want to see path to profit so they are starting the price skimming, it's just the beginning.

mupuff1234

I assume the model is just more expensive to run.

hirako2000

Likely. The point is we would never know.

mpeg

Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.

Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)

santhoshr

Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png

xnx

2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...

knownjorbist

Did you notice that this embedded a Gemini API connection within the app itself? Or am I not understanding what that is?

Alex-Programs

Incredible. Thanks for sharing.

robterrell

At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.

notatoad

i think anything that makes it clear they've done that would be a lot worse PR than failing the pelican test would ever be.

mohsen1

Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?

AstroBen

IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?

bn-l

It’s a good pelican. Not great but good.

golfer

tweakimp

Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?

rvnx

This is a list of questions and answers that was created by different people.

The questions AND the answers are public.

If the LLM manages through reasoning OR memory to repeat back the answer then they win.

The scores represent the % of correct answers they recalled.

stavros

I estimate another 7 months before models start getting 115% on Humanity's Last Exam.

HardCodedBias

If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.

The person also claims that with thinking on the gap narrows considerably.

We'll probably have 3rd party benchmarks in a couple of days.

iamdelirium

This is easily shown that the numbers are for GPT 5.1 thinking high.

Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard

sd9

How long does it typically take after this to become available on https://gemini.google.com/app ?

I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

mpeg

Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code

Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.

sd9

Thanks for the hint about Canvas/2.5. I have access to 3.0 in AI Studio now, and I agree the results are very similar.

magicalhippo

> https://gemini.google.com/app

How come I can't even see prices without logging in... they doing regional pricing?

Squarex

Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.

csomar

It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...

mil22

It's available to be selected, but the quota does not seem to have been enabled just yet.

"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

"You've reached your rate limit. Please try again later."

Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.

sarreph

Looks to be available in Vertex.

I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.

CjHuber

For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more

lousken

I hope some users will switch from cerebras to free up those resources

r0fl

Works for me.

misiti3780

seeing the same issue.

sottol

you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key.

when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".

i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.

nickandbro

What we have all been waiting for:

"Create me a SVG of a pelican riding on a bicycle"

https://www.svgviewer.dev/s/FfhmhTK1

Thev00d00

That is pretty impressive.

So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.

burkaman

Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

jmmcd

"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.

rixed

I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.

null

[deleted]

bitshiftfaced

It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.

xnx

Very aero

GodelNumbering

And of course they hiked the API prices

Standard Context(≤ 200K tokens)

Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)

Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)

Long Context(> 200K tokens)

Input $4.00 vs $2.50 (same +60%)

Output $18.00 vs $15.00 (same +20%)

panarky

Claude Opus is $15 input, $75 output.

CjHuber

Is it the first time long context has separate pricing? I hadn’t encountered that yet

1ucky

Anthropic is also doing this for long context >= 200k Tokens on Sonnet 4.5

Topfi

Google has been doing that for a while.

brianjking

Google has always done this.

CjHuber

Ok wow then I‘ve always overlooked that.

NullCascade

I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.

Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?

Try lowering the temperature, use SymPy etc.