Gemini 2.5 Pro Preview

727 comments

·May 6, 2025

segphault

My frustration with using these models for programming in the past has largely been around their tendency to hallucinate APIs that simply don't exist. The Gemini 2.5 models, both pro and flash, seem significantly less susceptible to this than any other model I've tried.

There are still significant limitations, no amount of prompting will get current models to approach abstraction and architecture the way a person does. But I'm finding that these Gemini models are finally able to replace searches and stackoverflow for a lot of my day-to-day programming.

jstummbillig

> no amount of prompting will get current models to approach abstraction and architecture the way a person does

I find this sentiment increasingly worrisome. It's entirely clear that every last human will be beaten on code design in the upcoming years (I am not going to argue if it's 1 or 5 years away, who cares?)

I wished people would just stop holding on to what amounts to nothing, and think and talk more about what can be done in a new world. We need good ideas and I think this could be a place to advance them.

ssalazar

I code with multiple LLMs every day and build products that use LLM tech under the hood. I dont think we're anywhere near LLMs being good at code design. Existing models make _tons_ of basic mistakes and require supervision even for relatively simple coding tasks in popular languages, and its worse for languages and frameworks that are less represented in public sources of training data. I am _frequently_ having to tell Claude/ChatGPT to clean up basic architectural and design defects. Theres no way I would trust this unsupervised.

Can you point to _any_ evidence to support that human software development abilities will be eclipsed by LLMs other than trying to predict which part of the S-curve we're on?

xyzzy123

I can't point to any evidence. Also I can't think of what direct evidence I could present that would be convincing, short of an actual demonstration? I would like to try to justify my intuition though:

Seems like the key question is: should we expect AI programming performance to scale well as more compute and specialised training is thrown at it? I don't see why not, it seems an almost ideal problem domain?

* Short and direct feedback loops

* Relatively easy to "ground" the LLM by running code

* Self-play / RL should be possible (it seems likely that you could also optimise for aesthetics of solutions based on common human preferences)

* Obvious economic value (based on the multi-billion dollar valuations of vscode forks)

All these things point to programming being "solved" much sooner than say, chemistry.

cheema33

> I code with multiple LLMs every day and build products that use LLM tech under the hood. I dont think we're anywhere near LLMs being good at code design.

I too use multiple LLMs every day to help with my development work. And I agree with this statement. But, I also recognize that just when we think that LLMs are hitting a ceiling, they turn around and surprise us. A lot of progress is being made on the LLMs, but also on tools like code editors. A very large number of very smart people are focused on this front and a lot of resources are being directed here.

If the question is:

Will the LLMs get good at code design in 5 years?

I think the answer is:

Very likely.

I think we will still need software devs, but not as many as we do today.

thecupisblue

You're using them in reverse. They are perfect for generating code according to your architectural and code design templete. Relying on them for architectural design is like picking your nose with a pair of scissors - yeah technically doable, but one slip and it all goes to hell.

lallysingh

The software tool takes a higher-level input to produce the executable.

I'm waiting for LLMs to integrate directly into programming languages.

The discussions sound a bit like the early days of when compilers started coming out, and people had been using direct assembler before. And then decades after, when people complained about compiler bugs and poor optimizers.

Kinrany

We're talking about predicting the future, so we can only extrapolate.

Seeing the evidence you're thinking of would mean that LLMs will have solved software development by next month.

ArthurStacks

I run a software development company with dozens of staff across multiple countries. Gemini has us to the point where we can actually stop hiring for certain roles and staff have been informed they must make use of these tools or they are surplus to requirements. At the current rate of improvement I believe we will be operating on far less staff in 2 years time.

AIoverlord

[dead]

fragmede

https://chatgpt.com/c/681aa95f-fa80-8009-84db-79febce49562

it becomes a question of how much you believe it's all just training data, and how much you believe the LLM's got pieces that are composable. I've given the question on the link as an interview questions and had humans been unable to give as through an answer (which I chose to believe is due to specialization on elsewhere in the stack). So we're already at a place where some human software development abilities have been eclipsed on some questions. So then even if the underlying algorithms don't improve, and they just ingest more training data, then it doesn't seem like a total guess as to what part of the S-curve we're on - the number of questions for software development that LLMs are able to successfully answer will continue to increase.

DanHulton

> It's entirely clear that every last human will be beaten on code design in the upcoming years

Citation needed. In fact, I think this pretty clearly hits the "extraordinary claims require extraordinary evidence" bar.

aposm

I had a coworker making very similar claims recently - one of the more AI-positive engineers on my team (a big part of my department's job is assessing new/novel tech for real-world value vs just hype). I was stunned when I actually saw the output of this process, which was a multi-page report describing the architecture of an internal system that arguably needed an overhaul. I try to keep an open mind, but this report was full of factual mistakes, misunderstandings, and when it did manage to accurately describe aspects of this system's design/architecture, it made only the most surface-level comments about boilerplate code and common idioms, without displaying any understanding of the actual architecture or implications of the decisions being made. Not only this coworker but several other more junior engineers on my team proclaimed this to be an example of the amazing advancement of AI ... which made me realize that the people claiming that LLMs have some superhuman ability to understand and design computer systems are those who have never really understood it themselves. In many cases these are people who have built their careers on copying and pasting code snippets from stack overflow, etc., and now find LLMs impressive because they're a quicker and easier way to do the same.

sweezyjeezy

I would argue that what LLMs are capable of doing right now is already pretty extraordinary, and would fulfil your extraordinary evidence request. To turn it on its head - given the rather astonishing success of the recent LLM training approaches, what evidence do you have that these models are going to plateau short of your own abilities?

ArthurStacks

Beating humans isnt really what matters. Its enabling developers to design who cant.

Last month I had a staff member design and build a distributed system that would be far beyond their capabilities without AI assistance. As a business owner this allows me to reduce the dependency and power of the senior devs.

numpad0

We were all crazy hyped when NVIDIA demoed end-to-end self driving, weren't we? First order derivatives of a hype cycle curve at lower X values is always extremely large but it's not so useful. At large X it's obviously obvious. It's always had been that way.

kaliqt

Trends would dictate that this will keep scaling and surpass each goalpost year by year.

mark_l_watson

I recently asked o4-mini-high for a system design of something moderately complicated and provided only about 4 paragraphs of prompt for what I wanted. I thought the design was very good, as was the Common Lisp code it wrote when I asked it to implement the design; one caveat though: it did a much better job implementing the design in Python than Common Lisp (where I had to correct the generated code).

My friend, we are living in a world of exponential increase of AI capability, at least for the last few years - who knows what the future will bring!

pinoy420

[dead]

coffeemug

AlphaGo.

sirstoke

I’ve been thinking about the SWE employment conundrum in a post-LLM world for a while now, and since my livelihood (and that of my loved ones’) depends on it, I’m obviously biased. Still, I would like to understand where my logic is flawed, if it is. (I.e I’m trying to argue in good faith here)

Isn’t software engineering a lot more than just writing code? And I mean like, A LOT more?

Informing product roadmaps, balancing tradeoffs, understanding relationships between teams, prioritizing between separate tasks, pushing back on tech debt, responding to incidents, it’s a feature and not a bug, …

I’m not saying LLMs will never be able to do this (who knows?), but I’m pretty sure SWEs won’t be the only role affected (or even the most affected) if it comes to this point.

Where am I wrong?

MR4D

I think an analogy that is helpful is that of a woodworker. Automation just allowed them to do more things at in less time.

Power saws really reduced time, lathes even more so. Power drills changed drilling immensely, and even nail guns are used on roofing project s because manual is way too slow.

All the jobs still exist, but their tools are way more capable.

naasking

> Informing product roadmaps, balancing tradeoffs, understanding relationships between teams, prioritizing between separate tasks, pushing back on tech debt, responding to incidents, it’s a feature and not a bug, …

Ask yourself how many of these things still matter if you can tell an AI to tweak something and it can rewrite your entire codebase in a few minutes. Why would you have to prioritize, just tell the AI everything you have to change and it will do it all at once. Why would you have tech debt, that's something that accumulates because humans can only make changes on a limited scope at a mostly fixed rate. LLMs can already respond to feedback about bugs, features and incidents, and can even take advice on balancing tradeoffs.

Many of the things you describe are organizational principles designed to compensate for human limitations.

dgroshev

Software engineering (and most professions) also have something that LLMs can't have: an ability to genuinely feel bad. I think [1] it's hugely important and is an irreducible advantage that most engineering-adjacent people ignore for mostly cultural reasons.

[1]: https://dgroshev.com/blog/feel-bad/

concats

The way I see it:

* The world is increasingly ran on computers.

* Software/Computer Engineers are the only people who actually truly know how computers work.

Thus it seems to me highly unlikely that we won't have a job.

What that job entails I do not know. Programming like we do today might not be something that we spend a considerable amount of time doing in the future. Just like most people today don't spend much time handing punched-cards or replacing vacuum tubes. But there will still be other work to do, I don't doubt that.

acedTrex

> It's entirely clear that every last human will be beaten on code design in the upcoming years

In what world is this statement remotely true.

dullcrisp

In the world where idle speculation can be passed off as established future facts, i.e., this one I guess.

1024core

Proof by negation, I guess?

If someone were to claim: no computer will ever be able to beat humans in code design, would you agree with that? If the answer is "no", then there's your proof.

askl

In the delusional startup world.

mattgreenrocks

I'm always impressed by the ability of the comment section to come up with more reasons why decent design and architecture of source code just can't happen:

* "it's too hard!"

* "my coworkers will just ruin it"

* "startups need to pursue PMF, not architecture"

* "good design doesn't get you promoted"

And now we have "AI will do it better soon."

None of those are entirely wrong. They're not entirely correct, either.

astrange

> * "my coworkers will just ruin it"

This turns out to be a big issue. I read everything about software design I could get my hands on in years, but then at an actual large company it turned out to not help, because I'd never read anything about how to get others to follow the advice in my head from all that reading.

dullcrisp

It’s always so aggressive too. What fools we are for trying to write maintainable code when it’s so obviously impossible.

davidsainez

I use LLMs for coding every day. There have been significant improvements over the years but mostly across a single dimension: mapping human language to code. This capability is robust, but you still have to know how to manage context to keep them focused. I still have to direct them to consider e.g. performance or architecture considerations.

I'm not convinced that they can reason effectively (see the ARC-AGI-2 benchmarks). Doesn't mean that they are not useful, but they have their limitations. I suspect we still need to discover tech distinct from LLMs to get closer to what a human brain does.

jjice

I'm confused by your comment. It seems like you didn't really provide a retort to the parent's comment about bad architecture and abstraction from LLMs.

FWIW, I think you're probably right that we need to adapt, but there was no explanation as to _why_ you believe that that's the case.

TuringNYC

I think they are pointing out that the advantage humans have has been chipped away little by little and computers winning at coding is inevitable on some timeline. They are also suggesting that perhaps the GP is being defensive.

concats

I won't deny that in a context with perfect information, a future LLM will most likely produce flawless code. I too believe that is inevitable.

However, in real life work situations, that 'perfect information' prerequisite will be a big hurdle I think. Design can depend on any number of vague agreements and lots of domain specific knowledge, things a senior software architect has only learnt because they've been at the company for a long time. It will be very hard for a LLM to take all the correct decisions without that knowledge.

Sure, if you write down a summary of each and every meeting you've attended for the past 12 months, as well as attach your entire company confluence, into the prompt, perhaps then the LLM can design the right architecture. But is that realistic?

More likely I think the human will do the initial design and specification documents, with the aforementioned things in mind, and then the LLM can do the rest of the coding.

Not because it would have been technically impossible for the LLM to do the code design, but because it would have been practically impossible to craft the correct prompt that would have given the desired result from a blank sheet.

mbil

I agree that there’s a lot of ambiguity and tacit information that goes into building code. I wonder if that won’t change directly as a result of wanting to get more value out of agentic AI coders.

> Sure, if you write down a summary of each and every meeting you've attended for the past 12 months, as well as attach your entire company confluence, into the prompt, perhaps then the LLM can design the right architecture. But is that realistic?

I think it is definitely realistic. Zoom and Confluence already have AI integrations. To me it doesn’t seem long before these tools and more become more deeply MCPified, with their data and interfaces made available to the next generation of AI coders. “I’m going to implement function X with this specific approach based on your conversation with Bob last week.”

It strikes me that remote first companies may be at an advantage here as they’re already likely to have written artifacts of decisions and conversations, which can then provide more context to AI assistants.

null

[deleted]

Jordan-117

I recently needed to recommend some IAM permissions for an assistant on a hobby project; not complete access but just enough to do what was required. Was rusty with the console and didn't have direct access to it at the time, but figured it was a solid use case for LLMs since AWS is so ubiquitous and well-documented. I actually queried 4o, 3.7 Sonnet, and Gemini 2.5 for recommendations, stripped the list of duplicates, then passed the result to Gemini to vet and format as JSON. The result was perfectly formatted... and still contained a bunch of non-existent permissions. My first time being burned by a hallucination IRL, but just goes to show that even the latest models working in concert on a very well-defined problem space can screw up.

darepublic

Listen I don't blame any mortal being for not grokking the AWS and Google docs. They are a twisting labyrinth of pointers to pointers some of them deprecated though recommended by Google itself.

perching_aix

Sounds like a vague requirement, so I'd just generally point you towards the AWS managed policies summary [0] instead. Particularly the PowerUserAccess policy sounds fitting here [1] if the description for it doesn't raise any immediate flags. Alternatively, you could browse through the job function oriented policies [2] they have and see if you find a better fit. Can just click it together instead of bothering with the JSON. Though it sounds like you're past this problem by now.

[0] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_poli...

[1] https://docs.aws.amazon.com/aws-managed-policy/latest/refere...

[2] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_poli...

floydnoel

by asking three different models and then keeping everything single unique thing they gave you, i believe you actually maximized your chances of running into hallucinations.

instead of ignoring the duplicates, when i query different models, i use the duplicates as a signal that something might be more accurate. i wonder what your results might have looked like if you only kept the duplicated permissions and went from there.

dotancohen

AWS docs have (had) an embedded AI model that would do this perfectly. I suppose it had better training data, and the actual spec as a RAG.

djhn

Both AWS and Azure docs’ built in models have been absolutely useless.

mark_l_watson

I have a suggestion for you: Create a Gemini Gem for a programming language and put context info for library resources, examples of your coding style, etc.

I just dropped version 0.1 of my Gemini book, and I have an example for making a Gem (really simple to do); read online link:

https://leanpub.com/solo-ai/read

siscia

This problem have been solved by LSP (language server protocol), all we need is a small server behind MCP that can communicate LSP information back to the LLM and get the LLM to use by adding to the prompt something like: "check your API usage with the LSP"

The unfortunate state of open source funding makes buildings such simple tool a loosing adventure unfortunately.

satvikpendem

This already happens in agent modes in IDEs like Cursor or VSCode with Copilot, it can check for errors with the LSP.

doug_durham

If they never get good at abstraction or architecture they will still provide a tremendous amount of value. I have them do the parts of my job that I don't like. I like doing abstraction and architecture.

mynameisvlad

Sure, but that's not the problem people have with them nor the general criticism. It's that people without the knowledge to do abstraction and architecture don't realize the importance of these things and pretend that "vibe coding" is a reasonable alternative to a well-thought-out project.

Karrot_Kream

We can rewind the clock 10 years and I can substitute "vibe coding" for VBA/Excel macros and we'd get a common type of post from back then.

There's always been a demand for programming by non technical stakeholders that they try and solve without bringing on real programmers. No matter the tool, I think the problem is evergreen.

sanderjd

The way I see this is that it's just another skill differentiator that you can take advantage of if you can get it right.

That is, if it's true that abstraction and architecture are useful for a given product, then people who know how to do those things will succeed in creating that product, and those who don't will fail. I think this is true for essentially all production software, but a lot of software never reaches production.

Transitioning or entirely recreating "vibecoded" proofs of concept to production software is another skill that will be valuable.

Having a good sense for when to do that transition, or when to start building production software from the start, and especially the ability to influence decision makers to agree with you, is another valuable skill.

I do worry about what the careers of entry level people will look like. It isn't obvious to me how they'll naturally develop any of these skills.

codebolt

I've found they do a decent job searching for bugs now as well. Just yesterday I had a bug report on a component/page I wasn't familiar with in our Angular app. I simply described the issue as well as I could to Claude and asked politely for help figuring out the cause. It found the exact issue correctly on the first try and came up with a few different suggestions for how to fix it. The solutions weren't quite what I needed but it still saved me a bunch of time just figuring out the error.

M4v3R

That’s my experience as well. Many bugs involve typos, syntax issues or other small errors that LLMs are very good at catching.

yousif_123123

The opposite problem is also true. I was using it to edit code I had that was calling the new openai image API, which is slightly different from the dalle API. But Gemini was consistently "fixing" the OpenAI call even when I explained clearly not to do that since I'm using a new API design etc. Claude wasn't having that issue.

The models are very impressive. But issues like these still make me feel they are still more pattern matching (although there's also some magic, don't get me wrong) but not fully reasoning over everything correctly like you'd expect of a typical human reasoner.

disgruntledphd2

They are definitely pattern matching. Like, that's how we train them, and no matter how many layers of post training you add, you won't get too far from next token prediction.

And that's fine and useful.

mdp2021

> fine and useful

And crippled, incomplete, and deceiving, dangerous.

toomuchtodo

It seems like the fix is straightforward (check the output against a machine readable spec before providing it to the user), but perhaps I am a rube. This is no different than me clicking through a search result to the underlying page to verify the veracity of the search result surfaced.

disgruntledphd2

Why coding agents et al don't make use of the AST through LSP is a question I've been asking myself since the first release of GitHub copilot.

I assume that it's trickier than it seems as it hasn't happened yet.

redox99

Making LLMs know what they don't know is a hard problem. Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.

Volundr

> Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.

Are we sure they know these things as opposed to being able to consistently guess correctly? With LLMs I'm not sure we even have a clear definition of what it means for it to "know" something.

ajross

> Are we sure they know these things as opposed to being able to consistently guess correctly?

What is the practical difference you're imagining between "consistently correct guess" and "knowledge"?

LLMs aren't databases. We have databases. LLMs are probabilistic inference engines. All they do is guess, essentially. The discussion here is about how to get the guess to "check itself" with a firmer idea of "truth". And it turns out that's hard because it requires that the guessing engine know that something needs to be checked in the first place.

redox99

Yes. You could ask for factual information like "Tallest building in X place" and first it would answer it did not know. After pressuring it, it would answer with the correct building and height.

But also things where guessing was desirable. For example with a riddle it would tell you it did not know or there wasn't enough information. After pressuring it to answer anyway it would correctly solve the riddle.

The official llama 2 finetune was pretty bad with this stuff.

rdtsc

> Making LLMs know what they don't know is a hard problem. Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.

They are the perfect "fake it till you make it" example cranked up to 11. They'll bullshit you, but will do it confidently and with proper grammar.

> Many attempts at making them refuse to answer what they don't know caused them to refuse to answer things they did in fact know.

I can see in some contexts that being desirable if it can be a parameter that can be tweaked. I guess it's not that easy, or we'd already have it.

bezier-curve

The best way around this is to dump documentation of the APIs you need them privy to into their context window.

null

[deleted]

mountainriver

https://github.com/IINemo/lm-polygraph is the best work in this space

paulirish

> Gemini 2.5 Pro now ranks #1 on the WebDev Arena leaderboard

It'd make sense to rename WebDev Arena to React/Tailwind Arena. Its system prompt requires [1] those technologies and the entire tool breaks when requesting vanilla JS or other frameworks. The second-order implications of models competing on this narrow definition of webdev are rather troublesome.

[1] https://blog.lmarena.ai/blog/2025/webdev-arena/#:~:text=PROM...

aero142

If llms are able to write better code with more declarative and local programming components and tailwind, then I could imagine a future where a new programming language is created to maximize llm success.

epolanski

This so much.

To me it seems so strange that few good language designers and ml folks didn't group together to work on this.

It's clear that there is a space for some LLM meta language that could be designed to compile to bytecode, binary, JS, etc.

It also doesn't need to be textual like we code, but some form of AST llama can manipulate with ease.

senbrow

At that point why not just have LLMs generate bytecode in one shot?

Plenty of training data to go on, I'd imagine.

seb1204

Would this be addressed by better documentation of code and APIs as well as examples? All this would go into the training materials and then be the body of knowledge.

LZ_Khan

readability would probably be the sticking point

nicce

> I could imagine a future where a new programming language is created to maximize llm success.

Who will write the useful training data without LLMs? I feel we are getting less and less new things. Changes will be smaller and incremental.

shortcord

Not a fan of the dominance of shadcn and Tailwind when it comes to generating greenfield code.

BoorishBears

shadcn/ui is such a terrible thing for the frontend ecosystem, and it'll get even worse for it as AI gets better.

Instead of learnable, stable, APIs for common components with well established versioning and well defined tokens, we've got people literally copying and pasting components and applying diffs so they can claim they "own them".

Except the vast majority of them don't ever change a line and just end up with a strictly worse version of a normal package (typically out of date or a hodgepodge of "versions" because they don't want to figure out diffs), and the few that do make changes don't have anywhere near the design sense to be using shadcn since there aren't enough tokens to keep the look and feel consistent across components.

The would be 1% who would change it and have their own well thought out design systems don't get a lift from shadcn either vs just starting with Radix directly.

Amazing spin job though with the "registry" idea too: "it's actually very good for AI that we invented a parallel distribution system for ad-hoc components with no standard except a loose convention around sticking stuff in a folder called ui"

nicce

> It'd make sense to rename WebDev Arena to React/Tailwind Arena.

Funnily, training of these models feels getting cut mid of v3/v4 Tailwind release, and Gemini always try to correct my mistakes (… use v3 instead of v4)

baq

Same for some Material UI things in react. This is easily fixed by pasting relevant docs directly into the context, but annoying that you have to do that at all.

postalrat

I've found them to be pretty good with vanilla html and css.

codebolt

This model also seems to do a decent job with Angular. When I was using ChatGPT it was mostly stuck in pre-16 land, and struggled with signals etc, but this model seems to correctly suggest use of the latest features by default.

martinsnow

Bwoah it's almost as if react and tailwind is the bees knees ind frontend atm

byearthithatius

Sadly. Tailwind is so oof in my opinion. Lets import megabytes just so we don't have to write 5 whole CSS classes. I mean just copy paste the code.

Don't get me stared on how ugly the HTML becomes when most tags have 20 f*cking classes which could have been two.

johnfn

In most reasonably-sized websites, Tailwind will decrease overall bundle size when compared to other ways of writing CSS. Which is less code, 100 instances of "margin-left: 8px" or 100 instances of "ml-2" (and a single definition for ml-2)? Tailwind will dead-code eliminate all rules you're not using.

In typical production environments tailwind is only around 10kb[1].

[1]: https://v3.tailwindcss.com/docs/optimizing-for-production

martinsnow

You're doing it wrong. Tailwind is endlessly customizable and after compilation is only kilobytes. But yes lets complain because we don't understand the tooling....

xd1936

[flagged]

ranyume

I don't know if I'm doing something wrong, but every time I ask gemini 2.5 for code it outputs SO MANY comments. An exaggerated amount of comments. Sections comments, step comments, block comments, inline comments, all the gang.

lukeschlather

I usually remove the comments by hand. It's actually pretty helpful, it ensures I've reviewed every piece of code carefully, especially since most of the comments are literally just restating the next line, and "does this comment add any information?" is a really helpful question to make sure I understand the code.

tasuki

Same! It eases my code review. In the rare occasions I don't want to do that, I ask the LLM to provide the code without comments.

Benjammer

I've found that heavily commented code can be better for the LLM to read later, so it pulls in explanatory comments into context at the same time as reading code, similar to pulling in @docs, so maybe it's doing that on purpose?

koakuma-chan

No, it's just bad. I've been writing a lot of Python code past two days with Gemini 2.5 Pro Preview, and all of its code was like:

```python

def whatever():

  --- SECTION ONE OF THE CODE ---

  ...

  --- SECTION TWO OF THE CODE ---

  try:
    [some "dangerous" code]
  except Exception as e:
     logging.error(f"Failed to save files to {output_path}: {e}")
     # Decide whether to raise the error or just warn
     # raise IOError(f"Failed to save files to {output_path}: {e}")

```

(it adds commented out code like that all the time, "just in case")

It's terrible.

I'm back to Claude Code.

NeutralForest

I'm seeing it trying to catch blind exceptions in Python all the time. I see it in my colleagues code all the time, it's driving me nuts.

brandall10

It's certainly annoying, but you can try following up with "can you please remove superfluous comments? In particular, if a comment doesn't add anything to the understanding of the code, it doesn't deserve to be there".

null

[deleted]

breppp

I always thought these were there to ground the LLM on the task and produce better code, an artifact of the fact that this will autocomplete better based on past tokens. Similarly always thought this is why ChatGPT always starts every reply with repeating exactly what you asked again

rst

Comments describing the organization and intent, perhaps. Comments just saying what a "require ..." line requires, not so much. (I find it will frequently put notes on the change it is making in comments, contrasting it with the previous state of the code; these aren't helpful at all to anyone doing further work on the result, and I wound up trimming a lot of them off by hand.)

puika

I have the same issue plus unnecessary refactorings (that break functionality). it doesn't matter if I write a whole paragraph in the chat or the prompt explaining I don't want it to change anything else apart from what is required to fulfill my very specific request. It will just go rogue and massacre the entirety of the file.

mgw

This has also been my biggest gripe with Gemini 2.5 Pro. While it is fantastic at one-shotting major new features, when wanting to make smaller iterative changes, it always does big refactors at the same time. I haven't found a way to change that behavior through changes in my prompts.

Claude 3.7 Sonnet is much more restrained and does smaller changes.

cryptoz

This exact problem is something I’m hoping to fix with a tool that parses the source to AST and then has the LLM write code to modify the AST (which you then run to get your changes) rather than output code directly.

I’ve started in a narrow niche of python/flask webapps and constrained to that stack for now, but if you’re interested I’ve just opened it for signups: https://codeplusequalsai.com

Would love feedback! Especially if you see promising results in not getting huge refactors out of small change requests!

(Edit: I also blogged about how the AST idea works in case you're just that curious: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...)

polyaniline

Asking it explicitly once (not necessarily every new prompt in context) to keep output minimal and strive to do nothing more than it is told works for me.

nolist_policy

Can't you just commit the relevant parts? The git index is made for this sort of thing.

fwip

Really? I haven't tried Gemini 2.5 yet, but my main complaint with Claude 3.7 is this exact behavior - creating 200+ line diffs when I asked it to fix one function.

bugglebeetle

This is generally controllable with prompting. I usually include something like, “be excessively cautious and conservative in refactoring, only implementing the desired changes” to avoid.

fkyoureadthedoc

Where/how do you use it? I've only tried this model through GitHub Copilot in VS Code and I haven't experienced much changing of random things.

diggan

I've used it via Google's own AI studio and via my own library/program using the API and finally via Aider. All of them lead to the same outcome, large chunks of changes to a lot of unrelated things ("helpful" refactors that I didn't ask for) and tons of unnecessary comments everywhere (like those comments you ask junior devs to stop making). No amount of prompting seems to address either problems.

dherikb

I have the exactly same issue using it with Aider.

Maxatar

Tell it not to write so many comments then. You have a great deal of flexibility in dictating the coding style and can even include that style in your system prompt or upload a coding style document and have Gemini use it.

Trasmatta

Every time I ask an LLM to not write comments, it still litters it with comments. Is Gemini better about that?

grw_

No, you can tell it not to write these comments in every prompt and it'll still do it

sitkack

LLMs are extremely poor at following negative instructions, tell them what to do, not what not to do.

nearbuy

Sample size of one, but I just tried it and it worked for me on 2.5 pro. I just ended my prompt with "Do not include any comments whatsoever."

dheera

I usually ask ChatGPT to "comment the shit out of this" for everything it writes. I find it vastly helps future LLM conversations pick up all of the context and why various pieces of code are there.

If it is ingesting data, there should also be a sample of the data in a comment.

HenriNext

Same experience. Especially the "step" comments about the performed changes are super annoying. Here is my prompt-rule to prevent them:

"5. You must never output any comments about the progress or type of changes of your refactoring or generation. Example: you must NOT add comments like: 'Added dependency' or 'Changed to new style' or worst of all 'Keeping existing implementation'."

Workaccount2

I have a strong sense that the comments are for the model more than the user. It's effectively more thinking in context.

stavros

It definitely dumped its CoT into a huge comment just now when I asked it to add some function calls.

Scene_Cast2

It also does super defensive coding. Not that it's a bad thing in general, but I write a lot of prototype code.

prpl

Production quality code is defensive. Probably trained on a lot of google code.

Tainnor

Depends on what you mean by "defensive". Anticipating error and non-happy-path cases and handling them is definitely good. Also fault tolerance, i.e. allowing parts of the application to fail without bringing down everything.

But I've heard "defensive code" used for the kind of code where almost every method validates its input parameters, wraps everything in a try-catch, returns nonsensical default values in failure scenarios, etc. This is a complete waste because the caller won't know what to do with the failed validations or thrown errors, and it's just unnecessary bloat that obfuscates the business logic. Validation, error handling and so on should be done in specific parts of the codebase (bonus points if you can encode the successful validation or the presence/absence of errors in the type system).

montebicyclelo

Does the code consist of many large try except blocks that catch "Exception", which Gemini seems to like doing, (I thought it was a bad practice to catch the generic Exception in Python)

chr15m

Many of the comments don't even describe the code itself, but the change that was made to it. So instead of:

x = 1 // set X to 1

You get:

x = 1 // added this to set x to 1

And sometimes:

// x = 1 // removed this

These comments age really fast. They should be in a git commit not a comment.

As somebody who prefers code to self-describe what it is doing I find this behaviour a bit frustrating and I can't seem to context-prompt it away.

n_ary

May be these comments are actually originating from training annotated data? If I were to add code annotations for training data, I would sort of expect such comments which makes not much value for me but for the model, gives more contextual understanding…

laborcontract

My guess is that they've done a lot of tuning to improve diff based code editing. Gemini 2.5 is fantastic at agentic work, but it still is pretty rough around the edges in terms of generating perfectly matching diffs to edit code. It's probably one of the very few issues with the model. Luckily, aider tracks this.

They measure the old gemini 2.5 generating proper diffs 92% of the time. I bet this goes up to ~95-98% https://aider.chat/docs/leaderboards/

Question for the google peeps who monitor these threads: Is gemini-2.5-pro-exp (free tier) updated as well, or will it go away?

Also, in the blog post, it says:

  > The previous iteration (03-25) now points to the most recent version (05-06), so no action is required to use the improved model, and it continues to be available at the same price.

Does this mean gemini-2.5-pro-preview-03-25 now uses 05-06? Does the same apply to gemini-2.5-pro-exp-03-25?

update: I just tried updating the date in the exp model (gemini-2.5-pro-exp-05-06) and that doesnt work.

laborcontract

Update 2: I've been using this model in both aider and cline and I've haven't gotten a diff matching error yet, even with some pretty difficult substitutions across different places in multiple files. The overall feel of this model is nice.

I don't have a formal benchmark but there's a notable improvement in code generation due to this alone.

I've had gemini chug away on plans that have taken ~1 hour to implement. (~80mln tokens spent) A good portion of that energy was spent fixing mistakes made by cline/aider/roo due to search/replace mistakes. If this model gets anywhere close to 100% on diffs then this is a BFD. I estimate this will translate to a 50-75% productivity boost on long context coding tasks. I hope the initial results i'm seeing hold up!

I'm surprised by the reaction in the rest of the thread. A lot unproductive complaining, a lot of off topic stuff, nothing talking about the model itself.

Any thoughts from anyone else using the updated model?

esperent

Have you been using the Gemini 2.5 pro "Experimental" or "3-25" models in Cline? I've been using both over the last week and got quite a few diff errors, maybe 1/10 of edit so that 92% tracks for me.

Does this 2.5 pro "Preview" feel like an improvement if you had used the others?

laborcontract

Yep i’ve been using the old and new models in cline. I can’t tell any difference outside of the improvement with diffs, but that’s good enough for me.

vessenes

Question; are you calling it with “aider -model gemini”? And if so do you see 05-04 listed or the old one?

null

[deleted]

null

[deleted]

okdood64

What do you mean by agentic work in this context?

laborcontract

Knowing when to call functions, generating the proper function calling text structure, properly executing functions in sequence, knowing when it's completed its objective, and doing that over an extended context window.

mohsen1

I use Gemini for almost everything. But their model card[1] only compares to o3-mini! In known benchmarks o3 is still ahead:

        +------------------------------+---------+--------------+
        |         Benchmark            |   o3    | Gemini 2.5   |
        |                              |         |    Pro       |
        +------------------------------+---------+--------------+
        | ARC-AGI (High Compute)       |  87.5%  |     —        |
        | GPQA Diamond (Science)       |  87.7%  |   84.0%      |
        | AIME 2024 (Math)             |  96.7%  |   92.0%      |
        | SWE-bench Verified (Coding)  |  71.7%  |   63.8%      |
        | Codeforces Elo Rating        |  2727   |     —        |
        | MMMU (Visual Reasoning)      |  82.9%  |   81.7%      |
        | MathVista (Visual Math)      |  86.8%  |     —        |
        | Humanity’s Last Exam         |  26.6%  |   18.8%      |
        +------------------------------+---------+--------------+

[1] https://storage.googleapis.com/model-cards/documents/gemini-...

jsnell

The text in the model card says the results are from March (including the Gemini 2.5 Pro results), and o3 wasn't released yet.

Is this maybe not the updated card, even though the blog post claims there is one? Sure, the timestamp is in late April, but I seem to remember that the first model card for 2.5 Pro was only released in the last couple of weeks.

cbg0

o3 is $40/M output tokens and 2.5 Pro is $10-15/M output tokens so o3 being slightly ahead is not really worth 4 times more than gemini.

jorl17

Also, o3 is insanely slow compared to Gemini 2.5 Pro

i_have_an_idea

Not sure why this is being downvoted, but it's absolutely true.

If you're using these models to generate code daily, the costs add up.

Sure, I'll give a really tough problem to o3 (and probably over ChatGPT, not the API), but on general code tasks, there really isn't meaningful enough difference to justify 4x the cost.

null

[deleted]

andy12_

Interestingly, when compering benchmarks of Experimental 03-25 [1] and Experimental 05-06 [2] it seems the new version scores slightly lower in everything except on LiveCodeBench.

[1] https://storage.googleapis.com/model-cards/documents/gemini-... [2] https://deepmind.google/technologies/gemini/

arnaudsm

This should be the top comment. Cherry-picking is hurting this industry.

I bet they kept training on coding tasks, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.

cma

They likely knew continued training on code would have some amount of catastrophic forgetting on other stuff. They didn't throw away the old weights so probably not sunk cost fallacy going on, but since it is relatively new and they found out X% of API token spend was on coding agents (where X is huge), compared to what token spend distribution looked like on prior Geminis that couldn't code well, they probably didn't want the complexity and worse batching of having another model for it if the impacts weren't too large and decided they didn't weight coding enough initially and it is worth the tradeoffs.

luckydata

Or because they realized that coding is what most of those LLMs are used for anyways?

arnaudsm

They should have shown the benchmarks. Or market it as a coding model, like Qwen & Mistral.

merksittich

According to the article, "[t]he previous iteration (03-25) now points to the most recent version (05-06)." I assume this applies to both the free tier gemini-2.5-pro-exp-03-25 in the API (which will be used for training) and the paid tier gemini-2.5-pro-preview-03-25.

Fair enough, one could say, as these were all labeled as preview or experimental. Still, considering that the new model is slightly worse across the board in benchmarks (except for LiveCodeBench), it would have been nice to have the option to stick with the older version. Not everyone is using these models for coding.

zurfer

Just switching a pinned version (even alpha, beta, experimental, preview) to another model doesn't feel right.

I get it, chips are sparse and they want their capacity back, but it breaks trust with developers to just downgrade your model.

Call it gemini-latest and I understand that things will change. Call it *-03-25 and I want the same model that I got on 25th March.

nopinsight

Livebench.ai actually suggests the new version is better on most things.

https://livebench.ai/#/

jjani

Sounds like they were losing so much money on 2.5-Pro they came up with a forced update that made it cheaper to run. They can't come out with "we've made it worse across the board", nor do they want to be the first to actually raise prices, so instead they made a bit of a distill that's slightly better at coding so they can still spin it positively.

sauwan

I'd be surprised if this was a new base model. It sounds like they just did some post-training RL tuning to make this version specifically stronger for coding, at the expense of other priorities.

jjani

Every frontier model now is a distill of a larger unpublished model. This could be a slightly smaller distill, with potentially the extra tuning you're mentioning.

Workaccount2

Google doesn't pay the nvidia tax. Their TPUs are designed for Gemini and Gemini designed for their TPUs. Google is no doubt paying far less per token than every other AI house.

null

[deleted]

excerionsforte

Yes, it does worse but a far margin. Requires more instructions and way too eager to code without proper instructions unlike the 03-25 version. I want that version back.

planb

> We’ve seen developers doing amazing things with Gemini 2.5 Pro, so we decided to release an updated version a couple of weeks early to get into developers hands sooner. Today we’re excited to release Gemini 2.5 Pro Preview (I/O edition).

What's up with AI companies and their model naming? So is this an updated 2.5 Pro and they indicate it by appending "Preview" to the name? Or was it always called 2.5 Preview and this is an updated "Preview"? Why isn't it 2.6 Pro or 2.5.1 Pro?

rtaylorgarlock

Agreed. The current 'Pro 2.5 Preview' following another 'Preview' hurts my brain, and the beta badges we're all throwing on products aren't stopping people from using things in production. Rate limits only go so far.

OrangeMusic

Honestly they're iterating so fast that it looks like they simply gave up giving their models proper version names.

They should just use a date ¯\_(ツ)_/¯

killerstorm

Why can't they just use version numbers instead of this "new preview" stuff?

E.g. call it Gemini Pro 2.5.1.

lukeschlather

I take preview to mean the model may be retired on an accelerated timescale and replaced with a "real" model so it's dangerous to put into prod unless you are paying attention.

lolinder

They could still use version numbers for that. 2.5.1-preview becomes 2.5.1 when stable.

danenania

Scheduled tasks in ChatGPT are useful for keeping track of these kinds of things. You can have it check daily whether there's a change in status, price, etc. for a particular model (or set of models).

cdolan

I appriciate that you are trying to help

But I do not want to have to build a network of bots with non-deterministic outputs to simply stay on top of versions

mhh__

Are you saying you find model names like o4-mini-high-pro-experimental-version5 confusing and stupid?

herpdyderp

I agree it's very good but the UI is still usually an unusable, scroll-jacking disaster. I've found it's best to let a chat sit for around a few minutes after it has finished printing the AI's output. Finding the `ms-code-block` element in dev tools and logging `$0.textContext` is reliable too.

uh_uh

Noticed this too. There's something funny about billion dollar models being handicapped by stuck buttons.

energy123

The Gemini app has a number of severe bugs that impacts everyone who uses it, and those bugs have persisted for over 6 months.

There's something seriously dysfunctional and incompetent about the team that built that web app. What a way to waste the best LLM in the world.

kubb

It's the company. Letting incompetent people who are vocal rise to the top is a part of Google's culture, and the internal performance review process discourages excellence - doing the thousand small improvements that makes a product truly great is invisible to it, so nobody does it.

Software that people truly love is impossible to build in there.

thebytefairy

Like what? I use it daily and haven't come across any seriously dysfunctional or incompetent.

OsrsNeedsf2P

Loading the UI on mobile while on low bandwidth is also a non-starter. It simply doesn't work.

zoogeny

I've noticed this has gotten a bit better lately, they have obviously been making a lot of UI changes to studio. But yeah, the scroll-jacking as response chunks are streamed in is incredibly frustrating since the model is pretty wordy.

I should add as well, on long complex threads the UI can become completely unusable. I'll hover over the tab and see it using over 2Gb of memory in chrome. Every so often I have to open a completely new tab, cut-n-paste the url and continue the conversation in that new tab (where the memory tends to drop back down to 600MB).

hispanus

It's amazing to see how they haven't implemented such a trivial feature as a sticky "copy code" button for code blocks. Frustrating to say the least.

arnaudsm

Be careful, this model is worse than 03-25 in 10 of the 12 benchmarks (!)

I bet they kept training on coding, made everything worse on the way, and tried to hide it under the rug because of the sunk costs.

null

[deleted]

ramblerman

where do you see that?

arnaudsm

New model homepage : https://deepmind.google/technologies/gemini/

Old model card : https://storage.googleapis.com/model-cards/documents/gemini-...

They intentionally buried that information

jstummbillig

It seems that trying to build llms is the definition of accepting sunk cost.

simonw

Here's a summary of the 394 comments on this post created using the new gemini-2.5-pro-preview-05-06. It looks very good to me - well grouped, nicely formatted.

https://gist.github.com/simonw/7ef3d77c8aeeaf1bfe9cc6fd68760...

30,408 input, 8,535 output = 12.336 cents.

8,500 is a very long output! Finally a model that obeys my instructions to "go long" when summarizing Hacker News threads. Here's the script I used: https://gist.github.com/simonw/7ef3d77c8aeeaf1bfe9cc6fd68760...

user_7832

> Finally a model that obeys my instructions to "go long"

Something I have found is that when you give a reason for it, LLMs are more likely to do it. The first time I was in a metro with limited internet I told Claude (3.5) as such and asked it to reply in more detail as I could not type back and forth frequently… and it delivered extremely well. Since then I’ve found it to be a helpful prompt for all LLMs I’ve used it on.

cloudking

Pelican on a bicycle?

simonw

It's very good! https://simonwillison.net/2025/May/6/gemini-25-pro-preview/#...

ionwake

Is it possible to sue this with Cursor? If so what is the name of the model? gemini-2.5-pro-preview ?

edit> Its gemini-2.5-pro-preview-05-06

edit>Cursor syas it doesnt have "good support" et, but im not sure if this is a defualt message when it doesnt recognise a model? is this a big deal? should I wait until its officially supported by cursor?

Just trying to save time here for everyone - anyone know the answer?

tough

Cursor UI sucks, it tells me to use -auto mode- to be faster, but gemini 2.5 is way faster than any of the other free models, so just selecting that one is faster even if the UI says otherwise

chrisvalleybay

Actually, I find the Cursor UX to be vastly superior to any of the others. I've tried Visual Studio with Copilot, Roo, Cline, Trae and Windsurf. Nothing gets close to the experience of Cursor right now, IMO.

ionwake

yeah ive noticed this too, like wtf would I use Auto?

tough

another thing i hate is the cmd+enter (Accept) and cmd+del (cancel) being on the same button with tiny ui text and that changes interchangeably depending if its a command or edit or tool call

androng

At the bottom of the article it says no action is required and the Gemini-2.5-pro-preview-03-25 now points to the new model

ionwake

well alot of action was required such as adding the model so no idea what happened to the guy who wrote the article maybe there is a new cursor update now

bn-l

The one with exp in the name is free (you may have to add it yourself) but they train on you. And after a certain limit it becomes paid).

HN

Gemini 2.5 Pro Preview

Gemini 2.5 Pro Preview