Skip to content(if available)orjump to list(if available)

Gemini 2.5

Gemini 2.5

338 comments

·March 25, 2025

og_kalu

One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.

Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.

This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.

Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

with pastebin - https://pastebin.com/aiWuYcrF

comboy

I like how critique of LLMs evolved on this site over the last few years.

We are currently at nonsensical pacing while writing novels.

solardev

It's not really passing the Turing Test until it outsells Harry Potter.

eru

Well, strictly speaking outselling the Harry Potter would fail the Turing test: the Turing test is about passing for human (in an adversarial setting), not to surpass humans.

Of course, this is just some pedantry.

I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.

skyechurch

The most straightforward way to measure the pace of AI progress is by attaching a speedometer to the goalposts.

ruraljuror

We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?

jorl17

I didn't quite read parent's comment like that. I think it's more about how we keep moving the goalposts or, less cynically, how the models keep getting better and better.

I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.

I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.

ripped_britches

lol wouldn’t that be great to read this comment in 2022

sagarpatil

My prompt is nowhere near yours.

Just for fun: Asked it to rewrite the first page of ‘The Fountainhead’ where Howard is a computer engineer, the rewrite is hilarious lol.

https://gist.github.com/sagarspatil/e0b5443132501a3596c3a9a2...

nout

Using the AI in multiple phases is the approach that can handle this. Similarly to "Deep Research" approach - you can tell it to first generate a storyline with multiple twists and turns. Then ask the model to take this storyline and generate prompts for individual chapters. Then ask it to generate the individual chapters based on the prompts, etc.

bbor

Yup -- asking a chatbot to create a novel in one shot is very similar to asking a human to improvise a novel in one shot.

mikepurvis

But a future chatbot would be able to internally project manage itself through that process, of first emitting an outline, then producing draft chapters, then going back and critiquing itself and finally rewriting the whole thing.

og_kalu

It's not a problem of one-shotting it. It's that the details cause a collapse. Even if you tried breaking it down which i have, you'd run into the same problem unless you tried holding its hand for every single page and then - what's the point ? I want to read the story not co-author it.

fshr

I think you would be better off having the LLM help you build up the plot with high level chapter descriptions and then have it dig into each chapter or arc. Or start by giving it the beats before you ask it for help with specifics. That'd be better at keeping it on rails.

og_kalu

I don't disagree. Like with almost anything else involving LLMs, getting hands on produces better results but because in this instance, i much prefer to be the reader than the author or editor, it's really important to me that a LLM is capable of pacing long form writing properly on its own.

saberience

Random question, if you don't care about being a creator yourself, why do you even want to read long form writing written by an LLM? There are literally 10000s of actual human written books out there all of them better than anything an LLM can write, why not read them?

null

[deleted]

stephen_cagle

Opening with "like a struck flint carried on a wind that wasn’t blowing." <chuckles>

I don't know why, but that is just such a literal thing to say that it seems almost random.

solardev

19 pages?! Am I the only one who prefers an AI that jumps straight to the point?

- Buildup and happy background world-building

- Subtle foreshadowing

- Orcs attack

- Hero is saved by unlikely warrior of astounding beauty

- Evil is defeated until sales justify unnecessary sequel

That's the kind of story fit for the modern attention span...

Workaccount2

Can you share it on a text sharing site? It seems you hit your share quota

Medicineguy

While I'm sure the new Gemini model has made improvements, I feel like the user experience outside of the model itself is stagnating. I think OpenAI's interfaces, both web app and mobile app, are quite a bit more polished currently. For example, Gemini's speech recognition struggles with longer pauses and often enough cuts me off mid-sentence. Also, OpenAIs whisper model understands more context (for instance, saying “[...] plex, emby and Jellyfin [...]” is usually understood in whisper, but less often in Gemini) The Gemini web app lacks keyboard shortcuts for basic actions like opening a new chat or toggling the sidebar (good for privacy friendly pair programming). Last point off the top of my head would be the ability to edit messages beyond just the last one. That's possible in ChatGPT, but not in Gemini. Googlers are spending so much money for model training, I would appreciate spending some for making it fun to use :)

malisper

I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.

Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.

For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?

hmottestad

This looks like it’s been posted on Reddit 10 years ago:

https://www.reddit.com/r/math/comments/32m611/logic_question...

So it’s likely that it’s part of the training data by now.

canucker2016

You'd think so, but both Google's AI Overview and Bing's CoPilot output wrong answers.

Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."

Whoa. Math is not AI's strong suit...

Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."

Hats???

Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.

moritzwarhier

That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.

vicek22

The riddle has a different variants with hats https://erdos.sdslabs.co/problems/5

f1shy

Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.

motoxpro

I would say that 99.99% of humans do the same. Most people never come up with anything novel.

malisper

Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem

lolinder

I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.

There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:

* Random chance (these are still statistical machines after all)

* The problem resurfaced recently and shows up more often than it used to.

* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.

mattkevan

I think there’s a big push to train LLMs on maths problems - I used to get spammed on Reddit with ads for data tagging and annotation jobs.

Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.

Doesn’t seem like a role with long-term prospects.

stabbles

It gets it wrong when you give it 728. It claims (728, 182, 546). I won't share the answer so it won't appear in the next training set.

7e

Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.

drexlspivey

And if it wasn’t, it is now

thaumasiotes

> So it’s likely that it’s part of the training data by now.

I don't think this means what you think it means.

I did some interacting with the Tencent model that showed up here a couple days ago ( https://news.ycombinator.com/item?id=43447254 ):

> 你知道英国的 Steeleye Span 吗?

>> 是的,我知道英国传奇民谣摇滚乐队 Steeleye Span!他们是1970年代英国民谣复兴运动的重要代表之一,以融合传统英国民谣与摇滚乐器(如电吉他、鼓)的独特风格闻名。乐队核心成员包括 Maddy Prior(标志性女声)和 Tim Hart(多面手乐手),后加入的 Nigel Pegrum(传奇鼓手)也为他们的音效增添了厚重感。

>> 他们的经典作品包括:

>> - 《Gaudete》:一首中世纪拉丁文圣歌的震撼改编,曾在英国单曲榜冲到第14位;

>> - 《All Around My Hat》:同名专辑主打歌,成为民谣摇滚标杆;

>> - 《Cam Ye O'er Frae France》:对英伦历史的戏谑诠释。

>> 乐队的成员变动频繁,但始终保持着对传统音乐的创造性重构。近年来Maddy Prior仍活跃于音乐节,若想感受他们狂野的舞台张力,推荐从1970年代的黄金时期作品入手!需要具体推荐歌单或背景故事吗? [happy blushing emoji]

> 我喜欢他们唱的 The Victory 那首歌。你会怎么描述?

>> 您提到的《The Victory》可能是指 Steeleye Span 的某首作品,但他们的正式专辑中并没有同名歌曲的记录。

>> [further suggestions omitted]

> 咦?Steeleye Span 的 Storm Force Ten 包括哪首歌?

>> 您提到的 《Storm Force Ten》 是 Steeleye Span 的第十张录音室专辑,发布于 1975年。这张专辑延续了乐队将传统英国民谣与硬摇滚融合的风格,同时融入了凯尔特、中世纪和叙事性元素,整体充满活力与戏剧张力。以下是专辑中的完整曲目列表及亮点解析:

>> 专辑曲目

>> 1.《Roll Right Stones》- 以巨石阵传说为灵感,开场的笛声与电吉他交织出神秘氛围,Maddy Prior 的嗓音如史诗吟诵。

>> 2.《Let Her Go Down》- 民谣基底搭配摇滚节奏,歌词暗喻航海冒险与命运抗争,鼓点密集如浪涌。

>> 3.《Hark! The Heron Voiced》- 中世纪风格叙事曲,描绘自然与灵性的对话,长笛与提琴的对话感极强。

>> 4.《The Brisk Young Widow》- 轻快的舞曲节奏,讲述寡妇与浪子间的诙谐故事,展现乐队幽默一面。

>> 5.《The Storm Force Ten》- 同名主打歌,以暴风雨比喻爱情冲突,吉他solo如闪电般撕裂,高潮部分极具爆发力。

>> 6.《The Green Knight》- 源自亚瑟王传说,骑士与自然的对抗主题,管风琴与吉他的音色营造出古老战场气息。

>> 7.《The Blacksmith》- 改编自传统民谣,重摇滚编曲让铁匠的锤击声化作节奏基石,Maddy 的即兴高音惊艳。

>> 8.《Cam Ye O’er Frae France》- 政治讽刺民谣,调侃英国王室历史,双吉他互飙如刀剑交锋。

>> 9.《The Maid of Llanwellyn》- 凯尔特风格抒情曲,弦乐编排如泣如诉,展现乐队柔情一面。

>> 10.《The Whisky Jacobites》- 庆祝苏格兰独立精神的狂欢曲,手风琴与摇滚乐器的碰撞充满节日气氛。

>> 专辑亮点

>> - 历史与传说的沉浸感:从亚瑟王到苏格兰起义,歌词如一部微型史诗。

>> - 音色实验性:管风琴、手风琴与电吉他混搭出独特的“民谣重金属”质感。

>> - 动态对比:快如风暴的摇滚段落与静谧的民谣叙事交替出现,张力十足。

>> 若您偏爱《The Victory》的恢弘感,推荐优先听 《The Storm Force Ten》 和 《The Green Knight》,这两首的史诗级编曲堪称乐队巅峰之作!需要试听链接或歌词解析吗? [sunglasses emoji]

This is a question that obviously was in the training data. How do you get the answer back out of the training data?

_cs2017_

This is solvable in roughly half an hour on pen and paper by a random person I picked with no special math skills (beyond a university). This is far from a difficult problem. The "95%+" in math reasoning is a meaningless standard, it's like saying a model is better than 99.9% of world population in Albanian language, since less than 0.1% bother to learn Albanian.

Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.

This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).

windowshopping

The "random person" you picked is likely very, very intelligent and not at all a good random sample. I'm not saying this is difficult to the extent that it merits academic focus, but it is NOT a simple problem and I suspect less than 1% of the population could solve this in half an hour "with no special math skills." You have to be either exceedingly clever or trained in a certain type of reasoning or both.

sebzim4500

This is a great riddle. Unfortunately, I was easily able to find the exact question with a solution (albeit with a different number) online, thus it will have been in the training set.

Workaccount2

What makes this interesting is that while the question is online (on reddit, from 10 years ago) other models don't get the answer right. Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.

Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)

65×26×39=65910

sebzim4500

>Gemini also shows it's work and it seems to do a few orders of magnitude more calculating then the elegant answer given on reddit.

I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.

null

[deleted]

varispeed

Seems like we might need a section of internet that is off limits to robots.

Centigonal

everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt

kylebenzle

Or we could just accept that LLMs can only output what we have put in and calling them, "AI" was a misnomer from day one.

baq

It’s here and it’s called discord.

yifanl

You'd have better results if you had prompted it with the actual answer and asked how the first person came to the conclusion. Giving a number in the training set is very easy.

i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?

And of course, if it responds with a verbatim answer in the line of https://www.reddit.com/r/math/comments/32m611/logic_question..., we can be pretty confident what's happening under the hood.

null

[deleted]

mitko

Loved that puzzle, thanks for sharing it. I’ve solved a lot of math problems in the past but this one had a unique flavor of interleaving logical reasoning, partial information and a little bit of arithmetic.

semiinfinitely

I love how the entire comment section is getting one-shotted by your math riddle instead of the original post topic.

utopcell

Same here: My problem of choice is the 100 prisoners problem [1]. I used to ask simple reasoning questions in the style of "what is the day three days before the day after tomorrow", but nowadays when I ask such questions, I can almost feel the the NN giggling at the naivety of its human operator.

[1] https://en.wikipedia.org/wiki/100_prisoners_problem

simonw

I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.

Plus it drew me a very decent pelican riding a bicycle.

Notes here: https://simonwillison.net/2025/Mar/25/gemini/

redox99

Claude's pelican is way better than Gemini's

kridsdale3

I just tried your trademark benchmark on the new 4o Image Output, though it's not the same test:

https://imgur.com/a/xuPn8Yq

Unroasted6154

Was that an actual SVG?

simonw

No that's GPT-4o native image output.

freediver

Tops our benchmark in an unprecedented way.

https://help.kagi.com/kagi/ai/llm-benchmark.html

High quality, to the point. Bit on the slow side. Indeed a very strong model.

Google is back in the game big time.

aoeusnth1

It should be in the "reasoning" category, right? (still topping the charts there)

utopcell

That is some wide gap!

anotherpaulg

Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.

This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.

[0] https://aider.chat/docs/leaderboards/

sagarpatil

The only benchmark I care about. Thanks!

aoeusnth1

Am I correct in assuming that accuracy < using correct edit format? i.e. it made mistakes in 27% of the problems, 11% of which were due to (at least) messing up the diff format?

In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.

anotherpaulg

It does have fairly low adherence to the edit format, compared to the other frontier models. But it is much better than any previous Gemini model in this regard.

Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.

aoeusnth1

Ok, thanks for clearing that up.

Oras

These announcements have started to look like a template.

- Our state-of-the-art model.

- Benchmarks comparing to X,Y,Z.

- "Better" reasoning.

It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

devsda

Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better. Couple that with the chinese OEMs launching better or good enough devices at a lower price point, new launches from established players are not noteworthy anymore.

It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.

breppp

I think the greatest issue with buying a new phone today is ironically the seamless migration.

once you get all your apps, wallpaper, shortcut order and same OS, you really quickly get the feeling you spent 1000$ for the exact same thing

atonse

100% agree with you.

But it needs to be seamless to remove any friction from the purchase, but at the same time if it feels the same then we felt like we wasted money.

So what I usually do is buy a different colored phone and change the wallpaper.

My MacBook was the same. Seamless transition and 2 hours later I was used to the new m4 speeds.

flakiness

Phones are limited by hardware manufacturing, plus maybe the annual shopping cycle peaking at Christmas. People won't have bought multiple iPhones even in its heyday.

These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.

If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.

tibbar

For me, the most exciting part is the improved long-context performance. A lot of enterprise/RAG applications rely on synthesizing a bunch of possibly relevant data. Let's just say it's clearly a bottleneck in current models and I would expect to see a meaningful % improvement in various internal applications if long-context reasoning is up. Gemini was already one of my favorite models for this usecase.

So, I think these results are very interesting, if you know what features specifically you are using.

zwaps

But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context.

Excited to see if it works this time.

bhouston

> It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks.

They are all interchangeable. This is great for users as it adds to price pressure.

flir

Man, I hope those benchmarks actually measure something.

Legend2440

I would say they are a fairly good measure of how well the model has integrated information from pretraining.

They are not so good at measuring reasoning, out-of-domain performance, or creativity.

Workaccount2

Sooner or later someone is going to find "secret sauce" that provides a step-up in capability, and it will be closely guarded by whoever finds it.

As big players look to start monetizing, they are going to desperately be searching for moats.

bangaladore

Reasoning was supposed to be that for "Open" AI, that's why they go to such lengths to hide the reasoning output. Look how that turned out.

Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.

cratermoon

Sooner or later someone is going to find the "secret sauce" that allows building a stepladder tall enough to reach the moon.

It's called the "first step fallacy", and AI hype believers continue to fall for it.

cadamsdotcom

Why not snooze the news for a year and see what’s been invented when you get back. That’ll blow your mind properly. Because each of these incremental announcements contributes to a mind blowing rate of improvement.

The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.

The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.

sagarpatil

AI labs, it seems, use a template for system cards as well. OpenAI stands out because they showcase their employees using their tools for various use cases, which is refreshing.

diego_sandoval

I take this as a good thing, because they're beating each other every few weeks and using benchmarks as evidence.

If these companies start failing to beat the competition, then we should prepare ourselves for very creative writing in the announcements.

gtirloni

The improvements have been marginal at best. I wouldn't call that beating.

ototot

Maybe they just asked Gemini 2.5 to write the announcement.

cpeterso

And it was trained on the previous announcements.

xlbuttplug2

... which were also written by earlier Gemini versions.

schainks

I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.

Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.

throwaway2037

    > I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
This is useful feedback. I'm not here to shill for OpenAI, nor Google/Gemini, but can you share a concrete example? It would be interesting to hear more about your use case. More abstractly: Do you think these "moved entire workflows" offset a full worker, or X% of a full worker? I am curious to see how and when we will see low-end/junior knowledge workers displaced by solid LLMs. Listening to the Oxide and Friends podcast, I learned that they make pretty regular use of LLMs to create graphs using GNU plot. To paraphrase, they said "it is like have a good intern".

mindwok

Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.

I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.

kungfufrog

How do you use the model so quickly? Google AI Studio? Maybe I've missed how powerful that is.. I didn't see any easy way to pass it a whole code base!

mindwok

Yep! AI studio I think is the only way you can actually use it right now and AFAIK it's free.

ripped_britches

Wow holy smokes that is exciting

jnd0

> with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.

Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.

I hope they enable support for the recently released canvas mode for this model soon it will be a good match.

Workaccount2

It is almost certainly the "nebula" model on LLMarena that has been generating buzz for the last few days. I didn't test coding but it's reasoning is very strong.

greatgib

If you plan to use Gemini, be warned, here are the usual Big Tech dragons:

   Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...
The full extract of the terms of usage:

   How human reviewers improve Google AI

   To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.

cavisne

Google is the best of these. You either pay per token and there is no training on your inputs, or it’s free/a small monthly fee and there is training.

greatgib

And even worse:

   Conversations that have been reviewed or annotated by human reviewers (and related data like your language, device type, location info, or feedback) are not deleted when you delete your Gemini Apps activity because they are kept separately and are not connected to your Google Account. Instead, they are retained for up to three years.
Emphasis on "retained for up to three years" even if you delete it!!

kccqzy

Well they can't delete a user's Gemini conversations because they don't know which user a particular conversation comes from.

This seems better, not worse, than keeping the user-conversation mapping so that the user may delete their conversations.

mastodon_acc

How does it compare to OpenAI and anthropic’s user data retention policy?

greatgib

If i'm not wrong, Chatgpt states clearly that they don't use user data anymore by default.

Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.

KoolKat23

I don't think this is the same as the AI studio and API terms. This looks like your consumer facing Gemini T&C's.

summerlight

You can use a paid tier to avoid such issues. Not sure what you're expecting for those "experimental" models, which is in development and needs user feedback.

sauwan

I'm assuming this is true of all experimental models? That's not true with their models if you're on a paid tier though, correct?

suyash

More of a reason for new privacy guidelines specially for big tech and AI

mastodon_acc

I mean this is pretty standard for online llms. What is Gemini doing here that openai or Anthropic aren’t already doing?

vineyardmike

I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?

I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.

I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.

forbiddenvoid

I would consider this a case of "expectation management"-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0.

jstummbillig

I think it's reasonable. The development process is just not really comparable to other software engineering: It's fairly clear that currently nobody really has a good grasp on what a model will be while they are being trained. But they do have expectations. So you do the training, and then you assign the increment to align the two.

8n4vidtmkvmk

I figured you don't update the major unless you significantly change the... algorithm, for lack of a better word. At least I assume something major changed between how they trained ChatGPT 3 vs GPT 4, other than amount of data. But maybe I'm wrong.

KoolKat23

Funnily enough, from early indications (user feedback) this new model would've been worthy of the 3.0 moniker, despite what the benchmarks say.

aoeusnth1

I think it's because of the big jump in coding benchmarks. 74% on aider is just much, much better than before and worthy of a .5 upgrade.

utopcell

It may indicate a Tick-Tock [1] process.

[1] https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model

Workaccount2

At least for OpenAI, a .5 increment indicates a 10x increase in training compute. This so far seems to track for 3.5, 4, 4.5.

alphabetting

The elo jump and big benchmark gains could be justification

falcor84

Agreed, can't everyone just use semantic versioning, with 0.1 increments for regular updates?

laurentlb

Regarding semantic versioning: what would constitute a breaking change?

I think it makes sense to increase the major / minor numbers based on the importance of the release, but this is not semver.

falcor84

As I see it, if it uses a similar training approach and is expected to be better in every regard, then it's a minor release. Whereas when they have a new approach and where there might be some tradeoffs (e.g. longer runtime), it should be a major change. Or if it is very significantly different, then it should be considered an entirely differently named model.

joaogui1

Would be confusing for non-tech people once you did x.9 -> x.10

morkalork

Or drop the pretext of version numbers entirely since they're meaningless here and go back to classics like Gemini Experience, Gemini: Millennium Edition or Gemini New Technology

guelo

What would a major version bump look like for an llm?

jorl17

Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.

I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).

Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.

The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.

Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.

Breakthrough moment.

Alifatisk

Two years ago, Claude was known for having the largest context window and being able to remember tokens throughout the whole conversation.

Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.