Gemini 3 Pro Preview Live in AI Studio
119 comments
·November 18, 2025ttul
valtism
Parakeet TDT v3 would be really good at that
iagooar
What prompt do you use for that?
punnerud
I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)
gregsadetsky
I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.
3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:
[00:00] Greg: Hello.
[00:01] X: You great?
[00:02] Greg: Hi.
[00:03] X: I'm X.
[00:04] Y: I'm Y.
...
Super impressive!prodigycorp
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark, which involves type analysis. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
WhitneyLand
>>benchmarks are meaningless
No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
>>my fairly basic python benchmark
I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
thefourthchime
I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.
ofa0e
Your benchmarks should not involve IP.
ComplexSystems
Why? This seems like a reasonable task to benchmark on.
dekhn
Using a single custom benchmark as a metric seems pretty unreliable to me.
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
sosodev
How can you be sure that your benchmark is meaningful and well designed?
Is the only thing that prevents a benchmark from being meaningful publicity?
prodigycorp
I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.
I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.
My benchmark says that I will stick with Codex CLI for the foreseeable future.
gregsadetsky
I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?
I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?
Thanks
luckydata
I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.
ddalex
I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code
benterix
> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.
Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
adastra22
I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.
Iulioh
A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....
GPT4/3o might be the best we will ever have
testartr
and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much
it's easy to focus on what they can't do
prodigycorp
Except I'm not nit picking at some limitations with tokenization, like "how many as are there in strawberry" . If you "understand" the code, you shouldn't be getting it wrong.
Workaccount2
It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
lukebechtel
ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.
__jl__
API pricing is up to $2/M for input and $12/M for output
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
raincole
Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.
brianjking
It is so impressive that Anthropic has been able to maintain this pricing still.
Aeolun
Because every time I try to move away I realize there’s nothing equivalent to move to.
fosterfriends
Thrilled to see the cost is competitive with Anthropic.
jhack
With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.
xnx
There's a waitlist for using Gemini 3 for Gemini CLI free users: https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
hirako2000
Google went from the phase of loss leader, to bait-to-switch.
They have started lock in with studio, I would say they are still in market penetration but stakeholders want to see path to profit so they are starting the price skimming, it's just the beginning.
mupuff1234
I assume the model is just more expensive to run.
hirako2000
Likely. The point is we would never know.
mpeg
Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.
Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)
santhoshr
Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png
xnx
2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...
knownjorbist
Did you notice that this embedded a Gemini API connection within the app itself? Or am I not understanding what that is?
Alex-Programs
Incredible. Thanks for sharing.
robterrell
At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.
notatoad
i think anything that makes it clear they've done that would be a lot worse PR than failing the pelican test would ever be.
mohsen1
Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?
AstroBen
IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?
bn-l
It’s a good pelican. Not great but good.
golfer
Supposedly this is the model card. Very impressive results.
https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...
Also, the full document:
https://archive.org/details/gemini-3-pro-model-card/page/n3/...
tweakimp
Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?
rvnx
This is a list of questions and answers that was created by different people.
The questions AND the answers are public.
If the LLM manages through reasoning OR memory to repeat back the answer then they win.
The scores represent the % of correct answers they recalled.
stavros
I estimate another 7 months before models start getting 115% on Humanity's Last Exam.
HardCodedBias
If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.
The person also claims that with thinking on the gap narrows considerably.
We'll probably have 3rd party benchmarks in a couple of days.
iamdelirium
This is easily shown that the numbers are for GPT 5.1 thinking high.
Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard
sd9
How long does it typically take after this to become available on https://gemini.google.com/app ?
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
mpeg
Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code
Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.
sd9
Thanks for the hint about Canvas/2.5. I have access to 3.0 in AI Studio now, and I agree the results are very similar.
magicalhippo
> https://gemini.google.com/app
How come I can't even see prices without logging in... they doing regional pricing?
Squarex
Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.
csomar
It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...
mil22
It's available to be selected, but the quota does not seem to have been enabled just yet.
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
sarreph
Looks to be available in Vertex.
I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.
CjHuber
For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more
lousken
I hope some users will switch from cerebras to free up those resources
r0fl
Works for me.
misiti3780
seeing the same issue.
sottol
you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key.
when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".
i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.
nickandbro
What we have all been waiting for:
"Create me a SVG of a pelican riding on a bicycle"
Thev00d00
That is pretty impressive.
So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.
burkaman
Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
jmmcd
"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.
ddalex
https://www.svgviewer.dev/s/TVk9pqGE giraffe in a ferrari
rixed
I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.
null
bitshiftfaced
It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.
xnx
Very aero
GodelNumbering
And of course they hiked the API prices
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
panarky
Claude Opus is $15 input, $75 output.
NullCascade
I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
Try lowering the temperature, use SymPy etc.
My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.