Gemini 2.5

503 comments

·March 25, 2025

og_kalu

One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.

Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.

This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.

Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

with pastebin - https://pastebin.com/aiWuYcrF

comboy

I like how critique of LLMs evolved on this site over the last few years.

We are currently at nonsensical pacing while writing novels.

skyechurch

The most straightforward way to measure the pace of AI progress is by attaching a speedometer to the goalposts.

kaliqt

Oh, that's a good one. And it's true. There seems to be a massive inability for most people to admit the building impact of modern AI development on society.

josefx

They certainly seem to have moved from "it is literally skynet" and "FSD is just around the corner" in 2016 to "look how well it paces my first lady Trump/Musk slashfic" in 2025. Truly world changing.

orena

I've asked claude to explain what you meant... https://claude.ai/share/391160c5-d74d-47e9-a963-0c19a9c7489a

qnleigh

This is so on-point. Many things that we now take for granted from LLMs would have been considered sufficient evidence for AGI not all that long ago. Likely the only test of AGI is whether we can still come up with new goalpost.

Nition

Haha, so that's the first derivative of goalpost position. You could take the derivative of that to see if the rate of change is speeding up or slowing.

munksbeer

I love this comment.

solardev

It's not really passing the Turing Test until it outsells Harry Potter.

dragonwriter

> It's not really passing the Turing Test until it outsells Harry Potter.

Most human-written books don't do that, so that seems to be a ceiteria for a very different test that a Turing test.

silveraxe93

From Gary Marcus' (notable AI skeptic) predictions of what AI won't do in 2027:

> With little or no human involvement, write Pulitzer-caliber books, fiction and non-fiction.

So, yeah. I know you made a joke, but you have the same issue as the Onion I guess.

tummler

Let me toss a grenade in here.

What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?

Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?

ninetyninenine

the goal posts will be moved again. Tons of people clamoring the book is stupid and vapid and only idiots bought the book. When ai starts taking over jobs which it already has you’ll get tons of idiots claiming the same thing.

eru

Well, strictly speaking outselling the Harry Potter would fail the Turing test: the Turing test is about passing for human (in an adversarial setting), not to surpass humans.

Of course, this is just some pedantry.

I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.

jychang

To be fair, pacing as a big flaw of LLMs has been a constant complaint from writers for a long time.

There were popular writeups about this from the Deepseek-R1 era: https://www.tumblr.com/nostalgebraist/778041178124926976/hyd...

krzat

This either ends at "better than 50% of human novels" garbage or at unimaginably compelling works of art that completely obsoletes fiction writing.

Not sure what is better for humanity in long term.

WindyMiller

That could only obsolete fiction-writing if you take a very narrow, essentially commercial view of what fiction-writing is for.

I could build a machine that phones my mother and tells her I love her, but it wouldn't obsolete me doing it.

ruraljuror

We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?

jorl17

I didn't quite read parent's comment like that. I think it's more about how we keep moving the goalposts or, less cynically, how the models keep getting better and better.

I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.

I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.

rafaelmn

People are trying to use gen AI in more and more use-cases, it used to fall flat on its face at trivial stuff, now it got past trivial stuff but still scratching the boundaries of being useful. And that is not an attempt to make the gen AI tech look bad, it is really amazing what it can do - but it is far from delivering on hype - and that is why people are providing critical evaluations.

Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.

parineum

> Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.

That's a better criticism of college exams than the benchmarks and/or those exams likely have either the exact questions or very similar ones in the training data.

The list of things that LLMs do better than the average human tends to rest squarely in the "problems already solved by above average humans" realm.

stickfu

I don’t know why I keep submitting myself to hacker news but every few months I get the itch, and it only takes a few minutes to be turned off by the cynicism. I get that it’s from potentialy wizened tech heads who have been in the trenches and are being realistic. It’s great for that, but any new bright eyed and bushy tailed dev/techy, whatever, should stay far away until much later in their journey

ksec

Do we have any simple benchmarks ( and I know benchmarks are not everything ) that tests all the LLMs?

The pace is moving so fast I simply cant keep up. Or a ELI5 page which gives a 5 min explanation of LLM from 2020 to this moment?

basch

It’s more a bellwether or symptom of a flaw where the context becomes poisoned and continually regurgitates the same thought over and over.

deng

I have actually read it and agree it is impressive. I will not comment much on the style of the writing, since this is very much subjective, but I would rate it as the "typical" modern fantasy style, which aims at filling as much pages as possible: very "flowery" language, lots of adjectives/adverbs, lots of details, lots of high-school prose ("Panic was a luxury they couldn't afford"). Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it. If someone would give me this I would advise them to get a good copy editor.

There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.

Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.

hjnilsson

Thank you for taking the time to do a thorough read, I just skimmed it, and the prose is certainly not for me. To me it lacks focus, but as you say, this may be the style the readers enjoy.

And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).

It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.

blinding-streak

Regarding your last sentence, I agree. My stance is this: If you didn't bother to write it, why should I bother to read it?

deng

From a technical standpoint, this is incredible. A few years ago, computers had problems creating grammatically correct sentences. Producing a consistent narrative like this was science fiction.

From an artistic standpoint, the result is... I'd say: incredibly mediocre, with some glaring errors in between. This does not mean that an average person could produce a similar chapter. Gemini can clearly produce better prose than the vast majority of people. However, the vast majority of people does not publish books. Gemini would have to be on par with the best professional writers, and it clearly isn't. Why would you read this when there is no shortage of great books out there? It's the same with music, movies, paintings, etc. There is more great art than you could ever consume in your lifetime. All LLMs/GenAI do in art is pollute everything with their incredible mediocrity. For art (and artists), these are sad times.

meta_ai_x

It's more nuanced than that. There are certain material/content where it is mandatory/necessary to read them.

Ideally I'd prefer to read material written by a the top 1%ile expert in that field, but due to constraints you almost always get to read material written by a midwit, intern, junior associate. In which case AI written content is much better especially as I can interrogate the material and match the top 1%ile quality.

og_kalu

Quality is its own property separate from its creator. If a machine writing something bothers you irrespective of quality then don't read it. You think i would care ? I would not.

If this ever gets good enough to write your next bestseller or award winner, i might not even share it and if i did, i wouldn't care if some stranger read it or not because it was created entirely for my pleasure.

og_kalu

Yeah I just focused on how well it was paced and didn't give any instructions on style or try a second pass to spot any inconsistencies.

That would be the next step but I'd previously never thought going any further might be worth it.

KittenInABox

> Not a big fan of that since I really miss the time where authors could write single, self-contained books instead of a sprawling series over thousands of pages, but I know of course that this kind of thing is very successful and people seem to enjoy it.

When was this time you speak of?

nout

Using the AI in multiple phases is the approach that can handle this. Similarly to "Deep Research" approach - you can tell it to first generate a storyline with multiple twists and turns. Then ask the model to take this storyline and generate prompts for individual chapters. Then ask it to generate the individual chapters based on the prompts, etc.

bbor

Yup -- asking a chatbot to create a novel in one shot is very similar to asking a human to improvise a novel in one shot.

mikepurvis

But a future chatbot would be able to internally project manage itself through that process, of first emitting an outline, then producing draft chapters, then going back and critiquing itself and finally rewriting the whole thing.

og_kalu

It's not a problem of one-shotting it. It's that the details cause a collapse. Even if you tried breaking it down which i have, you'd run into the same problem unless you tried holding its hand for every single page and then - what's the point ? I want to read the story not co-author it.

fshr

I think you would be better off having the LLM help you build up the plot with high level chapter descriptions and then have it dig into each chapter or arc. Or start by giving it the beats before you ask it for help with specifics. That'd be better at keeping it on rails.

og_kalu

I don't disagree. Like with almost anything else involving LLMs, getting hands on produces better results but because in this instance, i much prefer to be the reader than the author or editor, it's really important to me that a LLM is capable of pacing long form writing properly on its own.

saberience

Random question, if you don't care about being a creator yourself, why do you even want to read long form writing written by an LLM? There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?

null

[deleted]

tluyben2

That was what I tried on the train [0] a few weeks ago. I used Groq to get something very fast to see if it would work at least somewhat. It gives you a PDF in the end. Plugging in a better model gave much better results (still not really readable if you actually try to; at a glance it's convincing though), however, it was so slow that testing what kind of impossible. Cannot really have things done in parallel either because it does need to know what it pushed out before, at least the summary of it.

[0] https://github.com/tluyben/bad-writer

sagarpatil

My prompt is nowhere near yours.

Just for fun: Asked it to rewrite the first page of ‘The Fountainhead’ where Howard is a computer engineer, the rewrite is hilarious lol.

https://gist.github.com/sagarspatil/e0b5443132501a3596c3a9a2...

didip

Give it time, this will be solved.

I envisioned that one day, a framework will be created that can persist LLM current state on disk and then "fragments of memories" can be paged in and out into memory.

When that happened, LLM will be able to remember everything.

smcleod

I have never used an LLM for fictional writing, but I have been writing large amounts of code with them for years. What I'd recommend is when you're defining your plan up front as to the sections of the content, simply state in which phase / chapter of the content they should meet.

Planning generated content is often more important to invest in than the writing of it.

Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).

pantsforbirds

I had Grok summarize + evaluate the first chapter with thinking mode enabled. The output was actually pretty solid: https://pastebin.com/pLjHJF8E.

I wouldn't be surprised if someone figured out a solid mixture of models working as a writer (team of writers?) + editor(s) and managed to generate a full book from it.

Maybe some mixture of general outlining + maintaining a wiki with a basic writing and editing flow would be enough. I think you could probably find a way to maintain plot consistency, but I'm not so sure about maintaining writing style.

malisper

I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.

Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.

For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?

hmottestad

This looks like it’s been posted on Reddit 10 years ago:

https://www.reddit.com/r/math/comments/32m611/logic_question...

So it’s likely that it’s part of the training data by now.

canucker2016

You'd think so, but both Google's AI Overview and Bing's CoPilot output wrong answers.

Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."

Whoa. Math is not AI's strong suit...

Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."

Hats???

Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.

moritzwarhier

That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.

vicek22

The riddle has a different variants with hats https://erdos.sdslabs.co/problems/5

Etherlord87

There's 3 toddlers on the floor. You ask them a hard mathematical question. One of the toddlers plays around pieces of paper on the ground and happens to raise one that has the right answer written on it.

- This kid is a genius! - you yell

- But wait, the kid has just picked an answer from the ground, it didn't actually come up...

- But the other toddlers could do it also but didn't!

null

[deleted]

malisper

Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem

lolinder

I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.

There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:

* Random chance (these are still statistical machines after all)

* The problem resurfaced recently and shows up more often than it used to.

* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.

mattkevan

I think there’s a big push to train LLMs on maths problems - I used to get spammed on Reddit with ads for data tagging and annotation jobs.

Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.

Doesn’t seem like a role with long-term prospects.

Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.

stabbles

It gets it wrong when you give it 728. It claims (728, 182, 546). I won't share the answer so it won't appear in the next training set.

toonalfrink

This whole answer hinges on knowing that 0 is not a positive integer, that's why I couldn't figure it out...

f1shy

Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.

jug

Have you given a reasoning model a novel problem and watched its chain of thought process?

Etherlord87

I think it can reason. At least if it can work in a loop ("thinking"). It's just that this reasoning is far inferior to human reasoning, despite what some people hastily claim.

motoxpro

I would say that 99.99% of humans do the same. Most people never come up with anything novel.

drexlspivey

And if it wasn’t, it is now

thaumasiotes

[flagged]

thaumasiotes

Is there a reason for the downvotes here? We can see that having the answer in the training data doesn't help. If it's in there, what's that supposed to show?

_cs2017_

This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university). This is far from a difficult problem. The "95%+" in math reasoning is a meaningless standard, it's like saying a model is better than 99.9% of world population in Albanian language, since less than 0.1% bother to learn Albanian.

Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.

This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).

windowshopping

The "random person" you picked is likely very, very intelligent and not at all a good random sample. I'm not saying this is difficult to the extent that it merits academic focus, but it is NOT a simple problem and I suspect less than 1% of the population could solve this in half an hour "with no special math skills." You have to be either exceedingly clever or trained in a certain type of reasoning or both.

sundarurfriend

I agree with your general point that this "random person" is probably not representative of anything close to an average person off the street, but I think the phrasing "very very intelligent" and "exceedingly clever" is kinda misleading.

In my experience, the difference between someone who solves this type of logic puzzle and someone who doesn't, has more to do with persistence and ability to maintain focus, rather than "intelligence" in terms of problem-solving ability per se. I've worked with college students helping them learn to solve these kinds of problems (eg. as part of pre-interview test prep), and in most cases, those who solve it and those who don't have the same rate of progress towards the solution as long as they're actively working at it. The difference comes in how quickly they get frustrated (at themselves mostly), decide they're not capable of solving it, and give up on working on it further.

I mention this because this frustration itself comes from a belief that the ability to solve these belongs some "exceedingly clever" people only, and not someone like them. So, this kind of thinking ends up being a vicious cycle that keeps them from working on their actual issues.

dskloet

I solved it in less than 15 minutes while walking my dog, no pen or paper. But I wouldn't claim to be a random person without math skills. And my very first guess was correct.

It was a fun puzzle though and I'm surprised I didn't know it already. Thanks for sharing.

wrasee

So in the three hours between you reading the puzzle in the parent comment, you stopped what you were doing, managed to get some other "random" person to stop what they were doing and spend half an hour of their time on a maths puzzle that at that point prior experience suggested could take a day? All within three hours?

That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.

Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.

Read: just a heavy flex about puzzle solving.

re-thc

> This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university).

I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".

Side note: couldn't even find pen and paper around in half an hour.

sebzim4500

This is a great riddle. Unfortunately, I was easily able to find the exact question with a solution (albeit with a different number) online, thus it will have been in the training set.

Workaccount2

What makes this interesting is that while the question is online (on reddit, from 10 years ago) other models don't get the answer right. Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.

Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)

65×26×39=65910

sebzim4500

>Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.

I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.

varispeed

Seems like we might need a section of internet that is off limits to robots.

Centigonal

everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt

baq

It’s here and it’s called discord.

kylebenzle

Or we could just accept that LLMs can only output what we have put in and calling them, "AI" was a misnomer from day one.

beefnugs

Why is this a great riddle? It sounds like incomplete nonsense to me:

It doesnt say anything about the skill levels of the participants, whether their answers are just guessing, or why they arent just guessing the sum of the other two people each time asked to provide more information?

It doesnt say the guy saying 65 is even correct

How could three statements of "no new information" give information to the first guy that didn't know the first time he was asked?

DangitBobby

2 and 3 saying they don't know eliminates some uncertainties 1 had about their own number (any combination where the other two would see numbers that could tell them their own). After those possibilities were eliminated, the 1st person has narrowed it down enough to actually know based on the numbers shown above the other 2. The puzzle could instead have been done in order 2, 3, 1 and 1 would not have needed to go twice.

I guess really the only missing information is that they have the exact same information you do, plus the numbers above their friends heads.

null

[deleted]

yifanl

You'd have better results if you had prompted it with the actual answer and asked how the first person came to the conclusion. Giving a number in the training set is very easy.

i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?

And of course, if it responds with a verbatim answer in the line of https://www.reddit.com/r/math/comments/32m611/logic_question..., we can be pretty confident what's happening under the hood.

null

[deleted]

semiinfinitely

I love how the entire comment section is getting one-shotted by your math riddle instead of the original post topic.

refulgentis

In general I find commentary here too negative on AI, but I'm a bit squeamish about maximalist claims re: AI mathematical reasoning vs. human population based off this, even setting aside lottery-ticket-hypothesis-like concerns.

It's a common logic puzzle, Google can't turn up an exact match to the wording you have, but ex. here: https://www.futilitycloset.com/2018/03/03/three-hat-problem/

utopcell

Same here: My problem of choice is the 100 prisoners problem [1]. I used to ask simple reasoning questions in the style of "what is the day three days before the day after tomorrow", but nowadays when I ask such questions, I can almost feel the the NN giggling at the naivety of its human operator.

[1] https://en.wikipedia.org/wiki/100_prisoners_problem

r0fl

Wow

Tried this in deepseek and grok and it kept thunking in loops for a while and I just turned it off

I haven’t seen a question loop this long ever.

Very impressed

Deepseek R1 got the right answer after a whopping ~10 minutes of thinking. I'm impressed and feel kind of dirty, I suspect my electricity use from this could have been put to better use baking a frozen pizza.

deepboy2

Just tried it on Deepseek (not R1, maybe V3-0324) and got the correct answer after 7-8 pages of reasoning. Incredible!

simonw

I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.

Plus it drew me a very decent pelican riding a bicycle.

Notes here: https://simonwillison.net/2025/Mar/25/gemini/

jillesvangurp

Have you considered that they must be training on images of pelicans driving bicycle's at this point ;-). At least given how often that comes up in your reviews, a smart LLM engineer might put their fingers on the scales a bit and optimize for those things that come up in reviews of their work a lot.

redox99

Claude's pelican is way better than Gemini's

jonomacd

I'm not so sure. I've run it a bunch of times. It makes a great pelican.

Personally I'm convinced this model is the best out there right now.

https://www.reddit.com/r/Bard/comments/1jjobaz/pelican_on_a_...

fao_

I think a competent 5yro could make a better pelican on a bicycle than that. Which to me feels like the hallmark of AI.

I mean, hell, I have drawings from when I was eight of leaves and they are botanically-accurate enough to still be used for plant identification, which itself is a very difficult task that people study decades for. I don't see why this is interesting or noteworthy, call me a neo-luddite if you must.

ggeorgovassilis

I've been following your blog for a while now, great stuff!

kridsdale3

I just tried your trademark benchmark on the new 4o Image Output, though it's not the same test:

https://imgur.com/a/xuPn8Yq

jonomacd

And the same thing with gemini 2.0 flash native image output.

https://imgur.com/a/V4YAkX5

It's sort of irrelevant though as the test is about SVGs.

Unroasted6154

Was that an actual SVG?

simonw

No that's GPT-4o native image output.

freediver

Tops our benchmark in an unprecedented way.

https://help.kagi.com/kagi/ai/llm-benchmark.html

High quality, to the point. Bit on the slow side. Indeed a very strong model.

Google is back in the game big time.

aoeusnth1

It should be in the "reasoning" category, right? (still topping the charts there)

causal

Remarkable how few tokens it needed to get a much better score than other reasoning models. Any chance of contamination?

85392_school

It makes me wonder how the token counting was implemented and if it missed the (not sent in API) reasoning.

freediver

Vaild concern, most likely thinking tokens were not counted due to API reporting changes.

utopcell

That is some wide gap!

anotherpaulg

Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.

This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.

[0] https://aider.chat/docs/leaderboards/

aoeusnth1

Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?

In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.

anotherpaulg

It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.

Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.

aoeusnth1

Ok, thanks for clearing that up.

sagarpatil

The only benchmark I care about. Thanks!

Oras

These announcements have started to look like a template.

- Our state-of-the-art model.

- Benchmarks comparing to X,Y,Z.

- "Better" reasoning.

It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

devsda

Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better. Couple that with the chinese OEMs launching better or good enough devices at a lower price point, new launches from established players are not noteworthy anymore.

It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.

breppp

I think the greatest issue with buying a new phone today is ironically the seamless migration.

once you get all your apps, wallpaper, shortcut order and same OS, you really quickly get the feeling you spent 1000$ for the exact same thing

atonse

100% agree with you.

But it needs to be seamless to remove any friction from the purchase, but at the same time if it feels the same then we felt like we wasted money.

So what I usually do is buy a different colored phone and change the wallpaper.

My MacBook was the same. Seamless transition and 2 hours later I was used to the new m4 speeds.

flakiness

Phones are limited by hardware manufacturing, plus maybe the annual shopping cycle peaking at Christmas. People won't have bought multiple iPhones even in its heyday.

These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.

If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.

tibbar

For me, the most exciting part is the improved long-context performance. A lot of enterprise/RAG applications rely on synthesizing a bunch of possibly relevant data. Let's just say it's clearly a bottleneck in current models and I would expect to see a meaningful % improvement in various internal applications if long-context reasoning is up. Gemini was already one of my favorite models for this usecase.

So, I think these results are very interesting, if you know what features specifically you are using.

zwaps

But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context.

Excited to see if it works this time.

bhouston

> It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks.

They are all interchangeable. This is great for users as it adds to price pressure.

flir

Man, I hope those benchmarks actually measure something.

Legend2440

I would say they are a fairly good measure of how well the model has integrated information from pretraining.

They are not so good at measuring reasoning, out-of-domain performance, or creativity.

Workaccount2

Sooner or later someone is going to find "secret sauce" that provides a step-up in capability, and it will be closely guarded by whoever finds it.

As big players look to start monetizing, they are going to desperately be searching for moats.

bangaladore

Reasoning was supposed to be that for "Open" AI, that's why they go to such lengths to hide the reasoning output. Look how that turned out.

Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.

cratermoon

Sooner or later someone is going to find the "secret sauce" that allows building a stepladder tall enough to reach the moon.

It's called the "first step fallacy", and AI hype believers continue to fall for it.

cadamsdotcom

Why not snooze the news for a year and see what’s been invented when you get back. That’ll blow your mind properly. Because each of these incremental announcements contributes to a mind blowing rate of improvement.

The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.

The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.

diego_sandoval

I take this as a good thing, because they're beating each other every few weeks and using benchmarks as evidence.

If these companies start failing to beat the competition, then we should prepare ourselves for very creative writing in the announcements.

gtirloni

The improvements have been marginal at best. I wouldn't call that beating.

ototot

Maybe they just asked Gemini 2.5 to write the announcement.

cpeterso

And it was trained on the previous announcements.

xlbuttplug2

... which were also written by earlier Gemini versions.

schainks

I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.

Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.

throwaway2037

    > I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.

This is useful feedback. I'm not here to shill for OpenAI, nor Google/Gemini, but can you share a concrete example? It would be interesting to hear more about your use case. More abstractly: Do you think these "moved entire workflows" offset a full worker, or X% of a full worker? I am curious to see how and when we will see low-end/junior knowledge workers displaced by solid LLMs. Listening to the Oxide and Friends podcast, I learned that they make pretty regular use of LLMs to create graphs using GNU plot. To paraphrase, they said "it is like have a good intern".

schainks

> can you share a concrete example?

Upload a complicated PDF of presentation and ask for insights that require some critical thinking about them.

> Do you think these "moved entire workflows" offset a full worker, or X% of a full worker

It can replace many junior analysts IMO.

cratermoon

Glaringly missing from the announcements: concrete use cases and products.

The Achilles heel of LLMs is the distinct lack of practical real-world applications. Yes, Google and Microsoft have been shoving the tech into everything they can fit, but that doesn't a product make.

throwaway2037

I would say Adobe is doing an excellent job of commercialising image manipulation and generation using LLMs. When I see adverts for their new features, they seem genuinely useful for normie users who are trying to edit some family/holiday photos.

kiratp

https://www.osmos.io/fabric

Practical, real-world application.

sebzim4500

ChatGPT has like 500M weekly active users, what are you on about?

cratermoon

"Well, Ed, there are 300 million weekly users of ChatGPT. That surely proves that this is a very real industry!" https://www.wheresyoured.at/longcon/

null

[deleted]

greatgib

If you plan to use Gemini, be warned, here are the usual Big Tech dragons:

   Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...

The full extract of the terms of usage:

   How human reviewers improve Google AI

   To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.

cavisne

Google is the best of these. You either pay per token and there is no training on your inputs, or it’s free/a small monthly fee and there is training.

greatgib

And even worse:

   Conversations that have been reviewed or annotated by human reviewers (and related data like your language, device type, location info, or feedback) are not deleted when you delete your Gemini Apps activity because they are kept separately and are not connected to your Google Account. Instead, they are retained for up to three years.

Emphasis on "retained for up to three years" even if you delete it!!

kccqzy

Well they can't delete a user's Gemini conversations because they don't know which user a particular conversation comes from.

This seems better, not worse, than keeping the user-conversation mapping so that the user may delete their conversations.

mastodon_acc

How does it compare to OpenAI and anthropic’s user data retention policy?

greatgib

If i'm not wrong, Chatgpt states clearly that they don't use user data anymore by default.

Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.

KoolKat23

I don't think this is the same as the AI studio and API terms. This looks like your consumer facing Gemini T&C's.

summerlight

You can use a paid tier to avoid such issues. Not sure what you're expecting for those "experimental" models, which is in development and needs user feedback.

sauwan

I'm assuming this is true of all experimental models? That's not true with their models if you're on a paid tier though, correct?

suyash

More of a reason for new privacy guidelines specially for big tech and AI

mastodon_acc

I mean this is pretty standard for online llms. What is Gemini doing here that openai or Anthropic aren’t already doing?

mindwok

Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.

I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.

weatherlite

360k tokens = how many lines of code approximately ? and also, if its an open source lib are you sure there's no mentions of this bug anywhere on the web?

mindwok

Not a huge library, around 32K LoC and no mention of the bug on the web - I was the first to encounter it (it’s since been fixed) unless the training data is super recent.

weatherlite

Impressive. I tend to think it managed to find the bug by itself which is pretty crazy without being able to debug anything. Then again I haven't seen the bug description, perhaps the description makes it super obvious where the problem lies.

kungfufrog

How do you use the model so quickly? Google AI Studio? Maybe I've missed how powerful that is.. I didn't see any easy way to pass it a whole code base!

mindwok

Yep! AI studio I think is the only way you can actually use it right now and AFAIK it's free.

markdog12

Interesting, I've been asking it to generate some Dart code, and it makes tons of mistakes, including lots of invalid code (static errors). When pointing out the mistakes, it thanks me and tells me it won't make it again, then makes it again on the very next prompt.

blinding-streak

Open the pod bay doors Hal.

I'm sorry Dave, I'm afraid I can't do that.

ripped_britches

Wow holy smokes that is exciting

nmfisher

How long did it take to sift through those?

jnd0

> with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.

Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.

I hope they enable support for the recently released canvas mode for this model soon it will be a good match.

Workaccount2

It is almost certainly the "nebula" model on LLMarena that has been generating buzz for the last few days. I didn't test coding but it's reasoning is very strong.

vineyardmike

I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?

I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.

I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.

forbiddenvoid

I would consider this a case of "expectation management"-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0.

jstummbillig

I think it's reasonable. The development process is just not really comparable to other software engineering: It's fairly clear that currently nobody really has a good grasp on what a model will be while they are being trained. But they do have expectations. So you do the training, and then you assign the increment to align the two.

8n4vidtmkvmk

I figured you don't update the major unless you significantly change the... algorithm, for lack of a better word. At least I assume something major changed between how they trained ChatGPT 3 vs GPT 4, other than amount of data. But maybe I'm wrong.

KoolKat23

Funnily enough, from early indications (user feedback) this new model would've been worthy of the 3.0 moniker, despite what the benchmarks say.

aoeusnth1

I think it's because of the big jump in coding benchmarks. 74% on aider is just much, much better than before and worthy of a .5 upgrade.

Workaccount2

At least for OpenAI, a .5 increment indicates a 10x increase in training compute. This so far seems to track for 3.5, 4, 4.5.

utopcell

It may indicate a Tick-Tock [1] process.

[1] https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model

alphabetting

The elo jump and big benchmark gains could be justification

falcor84

Agreed, can't everyone just use semantic versioning, with 0.1 increments for regular updates?

laurentlb

Regarding semantic versioning: what would constitute a breaking change?

I think it makes sense to increase the major / minor numbers based on the importance of the release, but this is not semver.

falcor84

As I see it, if it uses a similar training approach and is expected to be better in every regard, then it's a minor release. Whereas when they have a new approach and where there might be some tradeoffs (e.g. longer runtime), it should be a major change. Or if it is very significantly different, then it should be considered an entirely differently named model.

morkalork

Or drop the pretext of version numbers entirely since they're meaningless here and go back to classics like Gemini Experience, Gemini: Millennium Edition or Gemini New Technology

joaogui1

Would be confusing for non-tech people once you did x.9 -> x.10

guelo

What would a major version bump look like for an llm?

eru

Going from English to Chinese, I guess? Because that would not be a compatible version for most previous users.

jorl17

Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.

I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).

Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.

The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.

Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.

Breakthrough moment.

Alifatisk

Two years ago, Claude was known for having the largest context window and being able to remember tokens throughout the whole conversation.

Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.

nickandbro

Wow, was able to nail the pelican riding on a bicycle test:

https://www.svgviewer.dev/s/FImn7kAo

anon373839

That's actually too good to believe. I have a feeling simonw's favorite test has been special-cased...

Workaccount2

It seems pretty good at it. The hair on the boy is messed up, but still decent.

"A boy eating a sandwhich"

https://www.svgviewer.dev/s/VhcGxnIR

"A multimeter"

https://www.svgviewer.dev/s/N5Dzrmyt

sebzim4500

I doubt it is explicitly special cased, but now that it's all over twitter etc. it will have ended up many times in the training data.

KTibow

They could've RLed on SVGs - wouldn't be hard to render them, test adherence through Gemini or CLIP, and reward fittingly

locallost

What does nail mean? That's not a bicycle.

TonyTrapp

To be honest, it's in good company with real humans there: https://www.behance.net/gallery/35437979/Velocipedia

Maybe it learned from Gianluca's gallery!