Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison

335 comments

·March 31, 2025

phkahler

Here is a real coding problem that I might be willing to make a cash-prize contest for. We'd need to nail down some rules. I'd be shocked if any LLM can do this:

https://github.com/solvespace/solvespace/issues/1414

Make a GTK 4 version of Solvespace. We have a single C++ file for each platform - Windows, Mac, and Linux-GTK3. There is also a QT version on an unmerged branch for reference. The GTK3 file is under 2KLOC. You do not need to create a new version, just rewrite the GTK3 Linux version to GTK4. You may either ask it to port what's there or create the new one from scratch.

If you want to do this for free to prove how great the AI is, please document the entire session. Heck make a YouTube video of it. The final test is weather I accept the PR or not - and I WANT this ticket done.

I'm not going to hold my breath.

snickell

This is the smoothest tom sawyer move I've ever seen IRL, I wonder how many people are now grinding out your GTK4 port with our favorite LLM/system to see if it can. It'll be interesting to see if anyone gets something working with current-gen LLMs.

UPDATE: naive (just fed it your description verbatim) cline + claude 3.7 was a total wipeout. It looked like it was making progress, then freaked out, deleted 3/4 of its port, and never recovered.

phkahler

>> This is the smoothest tom sawyer move I've ever seen IRL

That made me laugh. True, but not really the motivation. I honestly don't think LLMs can code significant real-world things yet and I'm not sure how else to prove that since they can code some interesting things. All the talk about putting programmers out of work has me calling BS but also thinking "show me". This task seems like a good combination of simple requirements, not much documentation, real world existing problem, non-trivial code size, limited scope.

cluckindan

I agree. I tried something similar: a conversion of a simple PHP library from one system to another. It was only like 500 loc but Gemini 2.5 completely failed around line 300, and even then its output contained straight up hallucinations, half-brained additions, wrong namespaces for dependencies, badly indented code and other PSR style violations. Worse, it also changed working code and broke it.

nico

> I honestly don't think LLMs can code significant real-world things yet and I'm not sure how else to prove that since they can code some interesting things

In my experience it seems like it depends on what they’ve been trained on

They can do some pretty amazing stuff in python, but fail even at the most basic things in arm64 assembly

These models have probably not seen a lot of GTK3/4 code and maybe not even a single example of porting between the two versions

I wonder if finetuning could help with that

snickell

Yes, very much agree, an interesting benchmark. Particularly because it’s in a “tier 2” framework (gtkmm) in terms of amount of code available to train an LLM on. That tests the LLMs ability to plan and problem solve compared with, say, “convert to the latest version of react” where the LLM has access to tens of thousands (more?) of similar ports in its training dataset and more has to pattern match.

codebra

Programmers who code interesting things likely shouldn’t worry. The legions who code voluminous but shallow corporate apps and glue might be more concerned.

ksec

>All the talk about putting programmers out of work

I keep thinking may be specifically Web programmers. Given a lot of the web essentially CRUD / have the same function.

SV_BubbleTime

Smooth? Nah.

Tom Sawyer? Yes.

jchw

I suspect it probably won't work, although it's not necessarily because an LLM architecture could never perform this type of work, but rather because it works best when the training set contains inordinate sample data. I'm actually quite shocked at what they can do in TypeScript and JavaScript, but they're definitely a bit less "sharp" when it comes to stuff outside of that zone in my experience.

The ridiculous amount of data required to get here hints that there is something wrong in my opinion.

I'm not sure if we're totally on the same page, but I understand where you're coming from here. Everyone keeps talking about how transformational these models are, but when push comes to shove, the cynicism isn't out of fear or panic, its disappointment over and over and over. Like, if we had an army of virtual programmers fixing serious problems for open source projects, I'd be more excited about the possibilities than worried about the fact that I just lost my job. Honest to God. But the thing is, if that really were happening, we'd see it. And it wouldn't have to be forced and exaggerated all the time, it would be plainly obvious, like the way AI art has absolutely flooded the Internet... except I don't give a damn if code is soulless as long as it's good, so it would possibly be more welcome. (The only issue is that it most likely actually suck when that happens, and rather just be functional enough to get away with, but I like to try to be optimistic once in a while.)

You really make me want to try this, though. Imagine if it worked!

Someone will probably beat me to it if it can be done, though.

skydhash

> the cynicism isn't out of fear or panic, its disappointment over and over and over

Very much this. When you criticize LLM's marketing, people will say you're a ludite.

I'd bet that no one actually likes to write code, as in typing into an editor. We know how to do it, and it's easy enough to enter in a flow state while doing it. But everyone is trying to write less code by themselves with the proliferation of reusable code, libraries, framework, code generators, metaprogramming,...

I'd be glad if I could have a DAW or CAD like interface with very short feedback (the closest is live programming with Smalltalk). So that I don't have to keep visualizing the whole project (it's mentally taxing).

e3bc54b2

> no one actually likes to write code

between this and..

> But everyone is trying to write less code by themselves with the proliferation of reusable code, libraries, framework, code generators, metaprogramming

.. this, is a massive gap. Personally speaking, I hate writing boilerplate code, y'know, old school Java with design patterns getter/setter, redundant multi-layer catch blocks, stateful for loops etc. That gets on my nerves, because it increases my work for little benefits. Cue modern coding practices and I'm almost exclusively thinking how to design solution to the problem at hand, and almost all the code is business logic exclusive.

This is where a lot of LLMs just fail. Handholding them all the way to correct solution feels like writing boilerplate again, except worse because I don't know when I'll be done. It doesn't help that most code available for LLMs is JS/TS/Java where boilerplate galore, but somehow I doubt giving them exclusively good codebases will help.

galbar

>I'd bet that no one actually likes to write code

And you'd be wrong. I, for one, enjoy the process of handcrafting the individual mechanisms of the systems I create.

SpaceNoodled

I like writing code. It's a fun and creative endeavor to figure out how to write as little as possible.

3abiton

Imo they are still extremely limited compared to a senior coder. Take python, still most top ranking models struggle with our codebase, every now and then I try to test few, and handling complex part of the codebase to produce coherent features still fails. They require heavy handholding from our senior devs, which I am sure they use AI as assistants.

ModernMech

> if that really were happening, we'd see it.

You're right, instead what we see is the emergence of "vibe coding", which I can best describe as a summoning ritual for technical debt and vulnerabilities.

8note

the typescript and javascript business though - the ais definitely trained on old old javascript.

i kinda think "javacript, the good parts" should be part of the prompt for generating TS and JS. I've seen too much of ai writing the sketchy bad parts

jay_kyburz

So yesterday I wanted to convert a color pallet I had in Lua that was 3 rgb ints, to Javascript 0x000000 notation. I sighed, rolled my eyes, but before I started this incredibly boring mindless task, asked Gamini if it would just do it for me. It worked, and I was happy, and I moved on.

Something is happening, its just not exciting as some people make it sound.

jchw

Be a bit more careful with that particular use case. It usually works, but depending on circumstances, LLMs have a relatively high tendency to start making the wrong correlations and give you results that are not actually accurate. (Colorspace conversions make it more obvious, but I think even simpler problems can get screwed up.)

Of course, for that use case, you can _probably_ do a bit of text processing in your text processing tools of choice to do it without LLMs. (Or have LLMs write the text processing pipeline to do it.)

gavinray

Convert the GTK 3 and GTK 4 API documentation into a single `.txt` file each.

Upload one of your platform-specific C++ file's source, along with the doc `.txt` into your LLM of choice.

Either ask it for a conversion function-by-function, or separate it some other way logically such that the output doesn't get truncated.

Would be surprised if this didn't work, to be honest.

pera

Do you really need to provide the docs? I would have imagined that those docs are included in their training sets. There is even a guide on how to migrate from GTK3 to GTK4, so this seems to be a low-hanging fruit job for an LLM iff they are okay for coding.

dagw

Feeding them the docs makes a huge difference in my experience. The docs might be somewhere in the training set, but telling the LLM explicitly "Use these docs before anything else" solves a lot of problems the the LLM mixing up different versions of a library or confusing two different libraries with a similar API.

Workaccount2

LLMs are not data archives. They are god awful at storing data, and even calling them a lossy compression tool is a stretch because it implies they are a compression tool for data.

LLM's will always benefit from in context learning because they don't have a huge archive of data to draw on (and even when they do, they are not the best at selecting data to incorporate).

iamjackg

You might not need to, but LLMs don't have perfect recall -- they're (variably) lossy by nature. Providing documentation is a pretty much universally accepted way to drastically improve their output.

baq

It moves the model from 'sorta-kinda-maybe-know-something-about-this' to being grounded in the context itself. Huge difference for anything underrepresented (not only obscure packages and not-Python not-JS languages).

jchw

In my experience even feeding it the docs probably won't get it there, but it usually helps. It actually seems to work better if the document you're feeding it is also in the training data, but I'm not an expert.

nurettin

Docs make them hallucinate a lot less. Unfortunately, all those docs will eat up the context window. Claude has "projects" for uploading them and Gemini2.5+ just has a very large window so maybe that's ok.

vasergen

The training set is huge and model "forgets" some of the stuff, providing docs in context makes sense, plus docs could be up to date comparing to training set.

stickfigure

My coding challenges are all variations on "start with this 1.5M line Spring project, full of multi-thousand-line files..."

phkahler

To do the challenge one would just need to understand the platform abstraction layer which is pretty small, and write 1K to 2K LOC. We don't even use much of the GUI toolkit functionality. I certainly don't need to understand the majority of a codebase to make meaningful contributions in specific areas.

qwertox

But you are aware that their limited context length just won't be able to deal with this?

That's like saying that you're judging a sedan by its capability of performing the job of a truck.

Wait, you were being sarcastic?

stickfigure

I am indeed saying that a sedan is incapable of handling my gigantic open-pit superfund site.

But I'll go a little farther - most meaningful, long-lived, financially lucrative software applications are metaphorically closer to the open-pit mine than the adorable backyard garden that AI tools can currently handle.

amelius

FWIW, what I want most in Solvespace is a way to do chamfers and fillets.

And a way to define parameters (not sure if that's already possible).

phkahler

>> FWIW, what I want most in Solvespace is a way to do chamfers and fillets.

I've outlined a function for that and started to write the code. At a high level it's straight forward, but the details are complex. It'll probably be a year before it's done.

>> And a way to define parameters (not sure if that's already possible).

This is an active work in progress. A demo was made years ago, but it's buggy and incomplete. We've been working out the details on how to make it work. I hope to get the units issue dealt with this week. Then the relation constraints can be re-integrated on top - that's the feature where you can type arbitrary equations on the sketch using named parameters (variables). I'd like that to be done this year if not this summer.

stn8188

While I second the same request, I'm also incredibly grateful for Solvespace as a tool. It's my favorite MCAD program, and I always reach for it before any others. Thank you for your work on it!

amelius

Sounds great, thanks for all the good work!

By the way, if this would make things simpler, perhaps you can implement chamfering as a post-processing step. This makes it maybe less general, but it would still be super useful.

esafak

A chance for all those coding assistant companies like Devin to show their mettle!

Aperocky

They'll happily demo writing hello world in 50 languages, or maybe a personal profile page with moving! icons! Fancy stuff.

They won't touch this.

refulgentis

> I'm not going to hold my breath.

The snark and pessimism nerd-sniped me :)

I've used AI heavily to maintain a cross-platform wrapper around llama.cpp. I figure its worth a shot.

I took a look and wanted to try but hit several hard blocks right away.

- There is no gtk-4 branch :o (presuming branch = git branch...Perhaps this is some project-specific terminology for a set of flags or something, and that's why I can't find it?)

- There's some indicators it is blocked by wxWidgets requiring GTK-4 support, which sounds much larger scope than advertised -- am I misunderstanding?

ramesh31

You guys really need a Docker build. This dependency chain with submodules is a nightmare.

phkahler

I'm a hater of complexity and build systems in general. Following the instructions for building solvespace on Linux worked for me out of the box with zero issues and is not difficult. Just copy some commands:

https://github.com/solvespace/solvespace?tab=readme-ov-file#...

ramesh31

>I'm a hater of complexity and build systems in general.

But you already have a complex cmake build system in place. Adding a standard Docker image with all the deps for devs to compile on would do nothing but make contributing easier, and would not affect your CI/CD/testing pipeline at all. I followed the readme and spent half an hour trying to get this to build for MacOS before giving up.

If building your project for all supported environments requires anything more than a single one-line command, you're doing it wrong.

semi-extrinsic

Alternative perspective: you kids with your Docker builds need to roll up your sleeves and learn how to actually compile a semi-complicated project if you expect to be able to contribute back to said project.

Philpax

If your project is hard to build, that's your problem, not mine. I'll simply spend my time working on projects that respect it.

disgruntledphd2

I can see both perspectives! But honestly, making a project easier to build is almost always a good use of time if you'd like new people to contribute.

ramesh31

>"Alternative perspective: you kids with your Docker builds need to roll up your sleeves and learn how to actually compile a semi-complicated project if you expect to be able to contribute back to said project."

Well, that attitude is probably why the issue has been open for 2 years.

qwertox

Gemini is the only model which tells me when it's a good time to stop chatting because either it can't find a solution or because it dislikes my solution (when I actively want to neglect security).

And the context length is just amazing. When ChatGPT's context is full, it totally forgets what we were chatting about, as if it would start an entirely new chat.

Gemini lacks the tooling, there ChatGPT is far ahead, but at its core, Gemini feels like a better model.

FirmwareBurner

>Gemini is the only model which tells me when it's a good time to stop chatting because either it can't find a solution or because it dislikes my solution

Claude used to also do that. Only ChatGPT starts falling apart when I start to question it then gives in and starting to give me mistakes as answers just to please me.

criddell

I asked Claude this weekend what it could tell me about writing Paint.Net plugins and it responded that it didn't know much about that:

> I'd be happy to help you with information about writing plugins for Paint.NET. This is a topic I don't have extensive details on in my training, so I'd like to search for more current information. Would you like me to look up how to create plugins for Paint.NET?

qwertox

I mean responses like this one:

  I understand the desire for a simple or unconventional solution, however there are problems with those solutions.
  There is likely no further explanation that will be provided.
  It is best that you perform testing on your own.

  Good luck, and there will be no more assistance offered.
  You are likely on your own.

This was about a SOCKS proxy which was leaking when the OpenVPN provider was down while the container got started, so we were trying to find the proper way of setting/unsetting iptable rules.

My proposed solution was to just drop all incoming SOCKS traffic until the tunnel was up and running, but Gemini was hooked on the idea that this was a sluggish way of solving the issue, and wanted me to drop all outgoing traffic until the tun device existed (with the exception of DNS and VPN_PROVIDER_IP:443 for building the tunnel).

light_hue_1

You like that?

This junk is why I don't use Gemini. This isn't a feature. It's a fatal bug.

It decides how things should go, if its way is right, and if I disagree it tells me to go away. No thanks.

I know what's happening. I want it to do things on my terms. It can suggest things, provide alternatives, but this refusal is extremely unhelpful.

criddell

That sounds like you asked for plans to a perpetual motion machine.

airstrike

LOL that to me reads like an absolute garbage of a response. I'd unsubscribe immediately and jump ship to any of the competitors if I ever got that

dr_kiszonka

I like its assertiveness too, but sometimes I wish there was an "override" button to force it to do what I requested.

davedx

I'm still using ChatGPT heavily for a lot of my day-to-day, across multiple projects and random real life tasks. I'm interested in giving Claude and Gemini a good at some point; where is Gemini's tooling lacking, generally?

null

[deleted]

neal_

I was using gemini 2.5 pro yesterday and it does seem decent. I still think claude 3.5 is better at following instruction then the new 3.7 model which just goes ham messing stuff up. Really disappointed by Cursor and the Claude CLI tool, for me they create more problems then fix. I cant figure out how to use them on any of my projects with out them ruining the project and creating terrible tech debt. I really like the way gemini shows how much context window is left, i think every company should have this. To be honest i think there has been no major improvement beyond the original models which gained popularity first. Its just marginal improvements 10% better or something, and the free models like deepseek are actually better imo then anything openai has. I dont think the market can withstand the valuations of the big ai companies. They have no advantage, there models suck worse then free open source ones, and they charge money??? Where is the benefit to there product?? People originally said the models are the moat and methods are top secret, but turns out its pretty easy to reproduce these models, and its the application layer built on top of the models that is much more specific and has the real moat. People said the models would engulf these applications built ontop and just integrate natively.

cjonas

My only experience is via cursor but I'd agree in that context 3.7 is worse than 3.5. 3.7 goes crazy trying to fix any little linter errors and often gets confused and will just hammer away, making things worse until I stop generation. I think if I let it continue it would probably proposed rm -rf and start over at some point :).

Again, this could just have to do with the way cursor is prompting it.

heed

believe it or not, i had cursor in yolo mode just for fun recently and 3.7 rm -rf'd my home folder :(

neal_

thats crazy! I haven't heard of yolo mode?? dont they like restrict access to the project? but i guess the terminal is unrestricted? lol i wonder what it was trying to do

runekaagaard

I'm getting great and stable results with 3.7 on Claude desktop and mcp servers.

It feels like an upgrade from 3.5

travisgriggs

So glad to see this!! I thought it was just me!

The latest updates, I’m often like “would you just hold the f#^^ on trigger?!? Take a chill pill already”

theshrike79

I asked claude 3.7 to move a perfectly working module to another location.

What did it do?

A COMPLETE FUCKING REWRITE OF THE MODULE.

The result did work, because of unit tests etc. but still, it has a habit of going down the rabbit hole of fixing and changing 42 different things when you ask for one change.

martin-t

Whenever I read about LLMs or try to use them, I feel like I am asleep in a dream where two contradicting things can be true at the same time.

On one hand, you have people claiming "AI" can now do SWE tasks which take humans 30 minutes or 2 hours and the time doubles every X months so by Y year, SW development will be completely automated.

On the other hand, you have people saying exactly what you are saying. Usually that LLMs have issues even with small tasks and that repeated/prolonged use generates tech debt even if they succeed on the small tasks.

These 2 views clearly can't both be true at the same time. My experience is the second category so I'd like to chalk up the first as marketing hype but it's confusing how many people who have seemingly nothing to gain from the hype contribute to it.

bitcrusher

I'm not sure why this is confusing? We're seeing the phenomenon everywhere in culture lately. People WANT something to be true and try to speak it into existence. They also tend to be the people LEAST qualified to speak about the thing they are referencing. It's not marketing hype, it is propaganda.

Meanwhile, the 'experts' are saying something entirely different and being told they're wrong or worse, lying.

I'm sure you've seen it before, but this propaganda, in particular, is the holy grail of 'business people'. The ones who "have a great idea, just need you to do all the work" types. This has been going on since the late 70s, early 80s.

martin-t

Not necessarily confusing but very frustrating. This is probably the first time I encountered such a wide range of opinions and therefore such a wide range of uncertainty in a topic close to me.

When a bunch of people very loudly and confidently say your profession, and something you're very good at, will become irrelevant in the next few years, it makes you pay attention. And when you then can't see what they claim to be seeing, then it makes you question whether something is wrong with you or them.

aleph_minus_one

> Whenever I read about LLMs or try to use them, I feel like I am asleep in a dream where two contradicting things can be true at the same time.

This is called "paraconsistent logic":

* https://en.wikipedia.org/wiki/Paraconsistent_logic

* https://plato.stanford.edu/entries/logic-paraconsistent/

frankohn

> people claiming "AI" can now do SWE tasks which take humans 30 minutes or 2 hours

Yes people claim that but everyone with a grain of salt in his mind know this is not true. Yes, in some cases an LLM can write from scratch a python or web demo-like application and that looks impressive but it is still far from really replacing a SWE. Real world is messy and requires to be careful. It requires to plan, do some modifications, get some feedback, proceed or go back to the previous step, think about it again. Even when a change works you still need to go back to the previous step, double check, make improvements, remove stuff, fix errors, treat corner cases.

The LLM doesn't do this, it tries to do everything in one single step. Yes, even when it is in "thinking" mode, in thinks ahead and explore a few possibilities but it doesn't do several iterations as it would be needed in many cases. It does a first write like a brilliant programmers may do in one attempt but it doesn't review its work. The idea of feeding back the error to the LLM so that it will fix it works in simple cases but in most common cases, where things are more complex, that leads to catastrophes.

Also when dealing with legacy code it is much more difficult for an LLM because it has to cope with the existing code with all its idiosincracies. One need in this case a deep understanding of what the code is doing and some well-thought planning to modify it without breaking everything and the LLM is usually bad as that.

In short, LLM are a wonderful technology but they are not yet the silver bullet someone pretends it to be. Use it like an assistant to help you on specific tasks where the scope is small the the requirements well-defined, this is the domain where it does excel and is actually useful. You can also use it to give you a good starting point in a domain you are nor familiar or it can give you some good help when you are stuck on some problem. Attempt to give the LLM a stack to big or complex are doomed to failure and you will be frustrated and lose your time.

radicality

At first thought you are gonna talk about how various LLMs will gaslight you, and say something is true, then only change their mind once you provide a counter example and when challenged with it, will respond “I obviously meant it’s mostly true, in that specific case it’s false”.

mountainriver

My whole team feels like 3.7 is a letdown. It really struggles to follow instructions as others are mentioning.

Makes me think they really just hacked the benchmarks on this one.

ignoramous

Claude Sonnet 3.7 Thinking is also an unmitigated disaster for coding. I was mistaken that a "thinking" model would be better at logic. It turns out "thinking" is a marketing term, a euphemism for "hallucinating" ... though, not unsurprising when you actually take a look at the model cards for these "reasoning" / "thinking" LLMs; however, I've found these to work nicely for IR (information retrieval).

theshrike79

Overthinking without extra input is always bad.

It's super bad for humans too. You start to spiral down a dark path when your thoughts run away and make up theories and base more theories on those etc.

dimitri-vs

They definitely over-optimized it for agentic use - where the quality of the code doesn't matter as much as it's ability to run, even if just barely. When you view it from that perspective all that nested errors handling, excessive comments, 10 lines that can be done in 2, etc. start to make sense.

vlovich123

Have you tried wind surf? I’ve been really enjoying it and wondering if they do something on top to make it work better. The AI definitely still gets into weird rabbit holes and sometimes even injects security bugs (kept trying to add sandbox permissions for an iframe), but at least for UI work it’s been an accelerant.

thicTurtlLverXX

In the Rubic's cube example, to solve the cube gemini2.5 just uses the memorized scrambling sequence:

// --- Solve Function ---

function solveCube() { if (isAnimating || scrambleSequence.length === 0) return;

  // Reverse the scramble sequence
  const solveSequence = scrambleSequence
    .slice()
    .reverse()
    .map((move) => {
      if (move.endsWith("'")) return move.slice(0, 1); // U' -> U
      if (move.endsWith("2")) return move; // U2 -> U2
      return move + "'"; // U -> U'
    });

  let promiseChain = Promise.resolve();
  solveSequence.forEach((move) => {
    promiseChain = promiseChain.then(() => applyMove(move));
  });

  // Clear scramble sequence and disable solve button after solving
  promiseChain.then(() => {
    scrambleSequence = []; // Cube is now solved (theoretically)
    solveBtn.disabled = true;
    console.log("Solve complete.");
  });
}

afro88

Thank you. This is the insidious thing about black box LLM coding.

breadwinner

The loser in the AI model competition appears to be... Microsoft.

When ChatGPT was the only game in town Microsoft was seen as a leader, thanks to their wise investment in Open AI. They relied on Open AI's model and didn't develop their own. As a result Microsoft has no interesting AI products. Copilot is a flop. Bing failed to take advantage of AI, Perplexity ate their lunch.

Satya Nadella last year: “Google should have been the default winner in the world of big tech’s AI race”.

Sundar Pichai's response: “I would love to do a side-by-side comparison of Microsoft’s own models and our models any day, any time. They are using someone else's model.”

See: https://www.msn.com/en-in/money/news/sundar-pichai-vs-satya-...

dughnut

Copilot is the only authorized AI at my company (50K FTE). I would be cautious to make any assumptions about how well anyone is doing in the AI space without some real numbers. My cynical opinion on enterprise software sales is that procurement decisions have absolutely nothing to do with product cost, performance, or value.

ZeWaka

I don't think the Copilot product is a flop - they're doing quite well selling it along with GitHub and Visual Studio (Code).

The best part about it, coding-wise, is that you can choose between 7 different models.

airstrike

I think he's talking about Microsoft Copilot 365, not the coding assistant.

Makes one wonder how much they are offering to the owner of www.copilot.com and why on God's green earth they would abandon the very strong brand name "Office" and www.office.com

l5870uoo9y

Had to lookup office.com myself to see it; their office package is literally called MS Copilot.

breadwinner

I consider Copilot a flop because it can't do anything. For example open Copilot on Windows and ask it to increase volume. It can't do it, but it will give you instructions for how to do it. In other words it is no better than standalone AI chat websites.

maxloh

Note that Microsoft do have their own LLM team, and their own model called Phi-4.

https://huggingface.co/microsoft/phi-4

VladVladikoff

Recently I was looking for a small LLM that could perform reasonably well while answering questions with low latency, for near realtime conversations running on a single RTX 3090. I settled on Microsoft’s Phi-4 model so far. However I’m not sure yet if my choice is good and open to more suggestions!

mywittyname

I've been using claude running via Ollama (incept5/llama3.1-claude) and I've been happy with the results. The only annoyance I have is that it won't search the internet for information because that capability is disabled via flag.

jcmp

When my parent speak about AI, they call it Copilot. Mircosoft has a big Advantage that they can integrate AI in many daily used products, where it is not competing with their core product like Google

ErrorNoBrain

And google has it built into my phone's text message app

these days it seems like everyone is trying to get their AI to be the standard.

i wonder how things will look in 10 years.

gnatolf

Any way you can back up that Copilot is a flop?

breadwinner

Lots of articles on it... and I am not even talking about competitors like Benioff [1]. I am talking about user complaints like this [2]. Users expect Copilot to be fully integrated, like Cursor is into VSCode. Instead what you get is barely better than typing into standalone AI chats like Claude.AI.

[1] https://www.cio.com/article/3586887/marc-benioff-rails-again...

[2] https://techcommunity.microsoft.com/discussions/microsoft365...

paavohtl

The linked complaint is specifically about Microsoft Copilot, which despite the name is completely unrelated to the original GitHub Copilot. VS Code's integrated GitHub Copilot nowadays has the Copilot Edits feature, which can actually edit, refactor and generate files for you using a variety of models, pretty much exactly like Cursor.

marlott

[dead]

null

[deleted]

anotherpaulg

Gemini 2.5 Pro set a wide SOTA on the aider polyglot coding leaderboard [0]. It scored 73%, well ahead of the previous 65% SOTA from Sonnet 3.7.

I use LLMs to improve aider, which is >30k lines of python. So not a toy codebase, not greenfield.

I used Gemini 2.5 Pro for the majority of the work on the latest aider release [1]. This is the first release in a very long time which wasn't predominantly written using Sonnet.

The biggest challenge with Gemini right now is the very tight rate limits. Most of my Sonnet usage lately is just when I am waiting for Gemini’s rate limits to cool down.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...

atonse

As someone who just adopted Cursor (and MCP) 2-3 weeks ago, Aider seems like a different world.

The examples of "create a new simple video game" cause me to glaze over.

Do you have a screencast of how you use aider to develop aider? I'd love to see how a savvy expert uses these tools for real-world solutions.

anotherpaulg

I actually get asked for screencasts a lot, so I made recently made some [0].

The recording of adding support for 100+ new coding languages with tree-sitter [1] shows some pretty advanced usage. It includes using aider to script downloading a collection of files, and using ad-hoc bash scripts to have aider modify a collection of files.

[0] https://aider.chat/docs/recordings/

[1] https://aider.chat/docs/recordings/tree-sitter-language-pack...

atonse

This is perfect. Thank you!

sedgjh23

This is excellent, thank you.

overgard

I remember back in the day when I did Visual Basic in the 90s there were a lot of cool "New Project from Template" things in Visual Studio, especially when you installed new frameworks and SDKs and stuff like that. With a click of a button you had something that kind of looked like a professional app! Or even now, the various create-whatever-app tooling in npm and node keeps on that legacy.

Anyway, AI "coding" makes me think of that but on steroids. It's fine, but the hype around it is silly, it's like declaring you can replace Microsoft Word because "New Project From Template" you got a little rich text widget in a window with a toolbar.

One of the things mentioned in the article is the writer was confused that Claude's airplane was sideways. But it makes perfect sense, Claude doesn't really care about or understand airplanes, and as soon as you try to refine these New Project From Template things the AI quickly stops being useful.

aiauthoritydev

Visual basic created a revolution in software world especially for poor countries like India. You will be surprised how many systems were automated and turned into software driven processes. It was just mindblowing.

If AI driven software can do it on steroid it would be a massive impact on economy.

overgard

I see what you're saying, but even with visual basic you needed to understand the basics of coding to get anywhere. Vibe coding seems to be more about intentionally not understanding things. I just think that, if you can code it yourself, maybe AI saves you some time, if you can't code it yourself, AI gives you a fancy toy that you can't expand or change. And God help you if you make an online service..

bratao

From my use case, the Gemini 2.5 is terrible. I have a complex Cython code in a single file (1500 lines) for a Sequence Labeling. Claude and o3 are very good in improving this code and following the commands. The Gemini always try to do unrelated changes. For example, I asked, separately, for small changes such as remove this unused function, or cache the arrays indexes. Every time it completely refactored the code and was obsessed with removing the gil. The output code is always broken, because removing the gil is not easy.

dagw

That matches my experience as well. Gemini 2.5 Pro seems better at writing code from scratch, but Claude 3.7 seems much better at refactoring my existing code.

Gemini also seems more likely to come up with 'advanced' ideas (for better or worse). I for example asked both for a fast C++ function to solve an on the surface fairly simple computational geometry problem. Claude solved it in a straight ahead and obvious way. Nothing obviously inefficient, will perform reasonably well for all inputs, but also left some performance on the table. I could also tell at a glance that it was almost certainly correct.

Gemini on the other hand did a bunch of (possibly) clever 'optimisations' and tricks, plus made extensive use of OpenMP. I know from experience that those optimisations will only be faster if the input has certain properties, but will be a massive overhead in other, quite common, cases.

With a bit more prompting and questions from my part I did manage to get both Gemini and Claude to converge on pretty much the same final answer.

rom16384

You can fix this using a system prompt to force it to reply just with a diff. It makes the generation much faster and much less prone to changing unrelated lines. Also try reducing the temperature to 0.4 for example, I find the default temperature of 1 too high. For sample system prompts see Aider Chat: https://github.com/Aider-AI/aider/blob/main/aider/coders/edi...

fl_rn_st

This reflects my experience 1:1... even telling 2.5 Pro to focus on the tasks given and ignore everything else leads to it changing unrelated code. It's a frustrating experience because I believe at its core it is more capable than Sonnet 3.5/3.7

pests

> The Gemini always try to do unrelated changes. For example, I asked, separately, for small changes such as remove this unused function

For anything like this, I don’t understand trying to invoke AI. Just open the file and delete the lines yourself. What is AI going to do here for you?

It’s like you are relying 100% on AI when it’s a tool in your toolset.

joshmlewis

Playing devils advocate here, it's because removing a function is not always as simple as deleting the lines. Sometimes there are references to that function that you forgot about that the LLM will notice and automatically update for you. Depending on your prompt it will also go find other references outside of the single file and remove those as well. Another possibility is that people are just becoming used to interacting with their codebase through the "chat" interface and directing the LLM to do things so that behavior carries over into all interactions, even perceived "simple" ones.

matsemann

Any IDE will do this for you a hundred times better than current LLMs.

Fr3ck

I like to code with an LLMs help making iterative changes. First do this, then once that code is a good place, then do this, etc. If I ask it to make one change, I want it to make one change only.

redog

For me I had to upload the library's current documentation to it because it was using outdated references and changing everything that was working in the code to broken and not focusing on the parts I was trying to build upon.

Jcampuzano2

If you don't mind me asking how do you go about this?

I hear people commonly mention doing this but I can't imagine people are manually adding every page of the docs for libraries or frameworks they're using since unfortunately most are not in one single tidy page easy to copy paste.

SweetSoftPillow

https://github.com/mufeedvh/code2prompt https://github.com/yamadashy/repomix

genewitch

Have the AI write a quick script using bs4 or whatever to take the HTML dump and output json, then all the aider-likes can use that json as documentation. Or just the HTML, but that wastes context window.

dr_kiszonka

If you have access to the documentation source, you can concatenate all files into one. Some software also has docs downloadable as PDF.

amarcheschi

using outdated references and docs is something i've experienced more or less with every model i've tried, from time to time

rockwotj

I am hoping MCP will fix this. I am building an MCP integration with kapa.ai for my company to help devs here. I guess this doesn’t work if you don’t add in the tool

simonw

That's expected, because they almost all have training cut-off dates from a year ago or longer.

The more interesting question is if feeding in carefully selected examples or documentation covering the new library versions helps them get it right. I find that to usually be the case.

therealmarv

set temperature to 0.4 or lower.

mrinterweb

Adjusting temperature is something I often forget. I think Gemini can range between 0.0 <-> 2.0 (1.0 default). Lowering the temp should get more consistent/deterministic results.

hyperbovine

Maybe the Unladen Swallow devs ended up on the Gemini team.

ekidd

How are you asking Gemini 2.5 to change existing code? With Claude 3.7, it's possible to use Claude Code, which gets "extremely fast but untrustworthy intern"-level results. Do you have a prefered setup to use Gemini 2.5 in a similar agentic mode, perhaps using a tool like Cursor or aider?

bratao

For all LLMs, I´m using a simple prompt with the complete code in triple quotes and the command at the end, asking to output the complete code of changed functions. Then I use Winmerge to compare the changes and apply. I feel more confident doing this than using Cursor.

pests

Should really check out aider. Automates this but also does things like make a repo map of all your functions / signatures for non-included files so it can get more context.

kingkongjaffa

Is there a less biased discussion?

The OP link is a thinly veiled and biased advert for something called composio and really a biased and overly flowery view of Gemini 2.5 pro.

Example:

“Everyone’s talking about this model on Twitter (X) and YouTube. It’s trending everywhere, like seriously. The first model from Google to receive such fanfare.

And it is #1 in the LMArena just like that. But what does this mean? It means that this model is killing all the other models in coding, math, Science, Image understanding, and other areas.”

tempoponet

I don't see it.

Composio is a tool to help integration of LLM tool calling / MCPs. It really helped me streamline setting up some MCPs with Claude desktop.

I don't see how pushing Gemini would help their business beyond encouraging people to play with the latest and greatest models. There's a 1 sentence call-to-action at the end which is pretty tame for a company blog.

The examples don't even require you to use Composio - they're just talking about prompts fed to different models, not even focused on tool calling, MCPs, or the Composio platform.

ZeroTalent

I believe their point was that they are writing about what people want to read (a new AI breakthrough), possibly embellishing or cherry-picking results, although we can't prove/disprove it easily.

This approach yields more upvotes and views on their website, which ultimately leads to increased conversions for their tool.

viscanti

If it's not astroturfing, the people who are so vocal about it act in a way that's nearly indistinguishable from it. I keep looking for concrete examples of use cases that show it's better, and everything seems to point back to "everyone is talking about it" or anecdotal examples that don't even provide any details about the problem that Gemini did well on and that other models all failed at.

lionkor

If I give you hundreds millions of dollars for just making a clone of something that exists (an LLM) and hype the shit out of it, how far would you go?

throwup238

I would change the world™ and make it a better place®.

Analemma_

Zvi Moshowitz's blog [0] is IME a pretty good place to keep track of the state of things, it's well-sourced and in-depth without being either too technical or too vibes-based. Generally every time a model is declared the new best you can count on him to have a detailed post examining the claim within a couple days.

[0]: https://thezvi.substack.com/

antirez

In complicated code I'm developing (Redis Vector Sets) I use both Claude 3.7 and Gemini 2.5 PRO to perform code reviews. Gemini 2.5 PRO can find things that are outside Claude abilities, even if Gemini, as a general purpose model, is worse. But It's inherently more powerful at reasoning on complicated code stuff, threading, logical errors, ...

larodi

Is this to say that you're writing the code manually and having the model verify for various errors, or also employing the model for actual code work.

Do you instruct the code to write in "your" coding style?

antirez

For Vector Sets, I decided to write all the code myself, and I use the models very extensively for the following three goals:

1. Design chats: they help a lot as a counterpart to detect if there are flaws in your reasoning. However all the novel ideas in Vector Sets were consistently found by myself and not by the models, they are not there yet.

2. Writing tests. For the Python test code, I let the model write it, under very strict prompts explaining very well what a given test should do.

3. Code reviews: this saved myself and future users a lot of time, I believe.

The way I used the model to write C code was to write throw away programs in order to test if certain approaches could work: benchmarks, verification programs for certain invariants, and so forth.

larodi

Insightful

I personally tried long runs with say writing a plugin for QGIS, but then I found it is better to actually personally write some parts of the code, so to remember it. Also advancing with smaller chunks seems to result in less iterations.

Besides, indeed, the whole concept seems to not work so well with ingenious stuff. The model simply fails to understand unless lots of explaining.

The LLM assisted tech writing though seems to benefit a lot from the cursor/cline approach. Here, more than anywhere else, a careful review is also needed.

sfjailbird

Every test task, including the coding test, is a greenfield project. Everything I would consider using LLMs for is not. Like, I would always need it to do some change or fix on a (large) existing project. Hell, even the examples that were generated would likely need subsequent alterations (ten times more effort goes into maintaining a line of code than writing it).

So these tests are meaningless to me, as a measure of how useful these models are. Great for comparison with each other, but would be interesting to include some tests with more realistic work.

maxnevermind

Indeed, I surprised to see that is has been in top-10 on HN for today. I thought everyone already realized that all of those examples like "create a flappy bird game" are not realistic and do not reflect the actual usefulness of the model, very few professionals in the industry endlessly create flappy bird games for a living.

anonzzzies

For Gemini: play around with the temperature: the default is terrible: we had much better results with (much) lower values.

CjHuber

From my experience a temperature close to 0 creates the best code (meaning functioning without modifications). When vibe coding I now use a very high temperature for brainstorming and writing specifications, and then have the code written at a very low one.

SubiculumCode

What improved, specifically?

anonzzzies

Much better code.