GPT-5 vs. Sonnet: Complex Agentic Coding
122 comments
·August 8, 2025h4ny
I have been seeing different people reporting different results with different tasks. Watched a live stream that compared GPT-5, Gemini Pro 2.5, Claude 4 Sonnet, and GLM 4.5, and GPT-5 appeared to not follow instructions as well as the other three.
At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models (many benchmarks can be gamed).
The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.
epolanski
I will also state another semi-obvious thing that people seem to consistently forget: models are non deterministic.
You are not going to get the same output from GPT5 or Sonnet every time.
And this obviously compounds across many different steps.
E.g. give GPT5 the code to a feature (by pointing some files and tests) and tell it to review it and find improvement opportunities and write them down: depending on the size of the code, etc, the answers will slightly different.
I often do it in Cursor by having multiple agents review a PR and each of them: - has to write down their pr-number-review-model.md (e.g. pr-15-review-sonnet4.md) - has to review the reviews of the other files
Then I review it myself and try to decide what's valuable in there and what not. And to my disappointment (towards myself): - often they do point to valid flaws I would've not thought about - miss the "end-to-end" or general view of how the code fits in a program/process/business. What do I mean: sometimes the real feedback would be that we don't need it at all. But you need to have these conversations with AI earlier.
x187463
This has been ubiquitous for a while. Even here on HN every thread about these models (even this one, I'm sure) features an inordinate amount of disagreement between people vehemently declaring one model more useful than another. There truly seems to be no objective measurement of quality that can discern the difference between frontier models.
physix
I think this is actually good, because it means there is no clear winner who can sit back and demand rent. Instead they all work as hard as they can to stay competitive, hopefully thereby accelerating AI software engineering capabilities, with the investors footing the bill.
NitpickLawyer
Yeah, I agree. And prices are slowly coming down. Gemini 2.5 was cheaper than claude4, and (again depending on task) either on par or slightly below in quality. Now gpt5 is cheaper still (I think their -main is 10$/M?) and they also have -mini and -nano versions. The more choices we have the better it will be. As you said, without a clear winner we're about to get spoiled for choice, and there's no clear way for them to just sit on stuff and increase prices (yet). Plus there's some pressure coming from the open source releases. Not there in quality, but they are runnable "on prem", pretty cheap and keep getting better.
vineyardmike
> At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models
I think you’ll always have some disagreement generally in life, but especially for things like this. Code has a level of subjectivity. Good variable names, correct amount of abstraction, verbosity, over complexity, etc are at least partially opinions. That makes benchmarking something subjective tough. Furthermore, LLMs aren’t deterministic, and sometimes you just get a bad seed in the RNG.
Not only that, but the harness and prompt used to guide the model make a difference. Claude responds to the word “ultrathink”, but if GPT-5 uses “think harder”, then what should be in the prompt?
Anecdotally, I’ve had the best luck with agentic coding when using Claude Code with Sonnet. Better than Sonnet with other tools, and better than Claude Code with other models. But I mostly use Go and Dart and I aggressively manage the context. I’ve found GPTs can’t write zig at all, but Gemini can, but they can both write python excellently. All that said, if I didn’t like an answer, I’d prompt again, but liked the answer, never tried again with a different model to see if I’d like it even more. So it’s hard to know what could’ve been.
I’ve used a ton of models and harnesses. Cursor is good too, and I’ve been impressed with more models in cursor. I don’t get the hype of Qwen though because I’ve found it makes lots of small(er) changes in a loop, and that’s noisy and expensive. Gemini is also very smart but worse at following my instructions, but I never took the time to experiment with prompting.
jjfoooo4
There's certainly a symbiosis blog publishers and small startups wanting to be perceived as influential, and big companies releasing models and wanting favorable coverage.
I heavily discount same day commentary, there's a quid pro quo on early access vs favorable reviews (and yes, folks publishing early commentary aren't explicitly agreeing to write favorable things, but there's obvious bias baked in).
I don't think it's all particularly concerning, you can discount reviews that are coming out so quickly that's it's unlikely the reviewer has really used it very much.
muzani
If you were to objectively rank things, durian would be the best fruit in the world, python would be the best programming language, and the Tesla Model Y is the best car. Everyone has multiple inconsistent opinions on everything because everything is not the same.
Just pick something and use it. AI models are interchangeable. It's not as big a decision as buying a car or even a durian.
qsort
Thankfully that isn't a problem: we have scientific and reliable benchmarks to cut through the nonsense! Oh wait...
isaacremuant
> The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.
Can't help but laugh at this. It's like you just discovered skepticism and how the world actually works.
chromejs10
This should have been compared with Opus... I know OP says he didn't because of cost but if you're comparing who is better then you need to compare the best to the best... if Claude Opus 4.1 is significantly better than GPT 5 then that could offset the extra expense. Not saying it will... but forget cost if we want to compare solely the quality
nearbuy
For what it's worth, I've been trying Opus 4.1 in VS Code through GitHub Copilot and it's been really bad. Maybe worse than Sonnet and GPT 4.1. I'm not sure why it was doing so poorly.
In one instance, I asked it to optimize a roughly 80 line C# method that matches some object positions by object ID and delta encodes their positions from the previous frame. It seemed to be confused about how all this should work and output completely wrong code. It has all the context it needs in the file and the method is fairly self-contained. Other models did much better. GPT-5 understood what to do immediately.
I tried a few other tasks/questions that also had underwhelming results. Now I've switched to using GPT-5.
If you have a quick prompt you'd like me to try, I can share the results.
cpursley
Use Claude Code, the rest aren't worth the bother.
addandsubtract
What does Claude Code do differently to Copilot Agent? Shouldn't they produce the same(ish) result if they're using the same model?
muzani
Opus seems to need more babysitting IME, which is great if you are going to actually pair program. Terrible if you like leaving it to do its own thing or try to do multiple things at once.
epolanski
That's insightful.
I spend a lot of time planning tasks, generating various documents per pr (requirements, questions, todo), having AI poke my ideas (business/product/ux/code-wise) etc.
After 45 minutes of back and forth in general we end up with a detailed plan.
This has also many benefits: - writing tests becomes very simple (unit, integration, E2Es) - writing documentation becomes very simple - writing meaningful PRs becomes very simple
It is quite boring though, not gonna lie. But that's a price I have accepted for quality.
Also, clearing the ideas so much before hand often leads me to come with creative ideas later in the day, when I go for walks and review mentally what we've done/how.
bongodongobob
To me it seems that Opus is really good at writing code if you give it a spec. The other day I had Gpt come up with a spec for a DnD text game that uses the GPT API. It one shotted a 1k line program.
However, if I'm not detailed with it, it does seem to make weird choices that end up being unmaintainable. It's like it has poor creative instincts but is really good at following the directions you give it.
runako
re: the comments that Opus is not cost effective...The whole sales pitch behind these tools, and quite specifically the pitch OpenAI made yesterday, is that they will replace people, specifically programmers. Opus is cheaper than a US-based engineer. It's totally reasonable to use it as the benchmark if it's best.
Also keep in mind that many employees are not paying out of pocket for LLM use at work. A $1,000 monthly bill for LLM usage is high for an individual but not so much for a company that employees engineers.
michaelt
My experience with coding agents is they need a lot of hand-holding.
They're impressive despite that. But if Sonnet is $20/month and I have to intervene every 3 minutes, while Opus is $100/month and I have to intervene every 5 minutes? ¯\_(ツ)_/¯
epolanski
> My experience with coding agents is they need a lot of hand-holding.
So do engineers.
The difference is that IRL engineers know a lot about the context of the business, features, product, ux, stakeholders, expectations, etc, etc which means that the hand-holding is a long running process.
LLMs need all of these things to be clearly written down and specified in one shot.
runako
Really depends on who's paying the bill, and how much gets done between interventions, right?
Inverting the problem, one might ask how best to spend (say) $5,000 monthly on coding agents. I don't know the answer to that.
qeternity
> but forget cost if we want to compare solely the quality
I think this is the whole reason not to compare it to Opus...
bgirard
I agree. Opus is cost prohibitive for most longer coding tasks. The increase output doesn't justify the cost.
null
sergiotapia
You compare what can be used by most engineers. Most engineers are not going to spend that insane price of Opus. It's extremely high compared to all other models, so even if it is slightly better, it's a non-starter for engineering workloads.
markbao
Most engineers spending their own money maybe, but the cost of Opus is not that much compared to the output when the company is paying for it.
andsoitis
> t insane price of Opus
I believe Opus starts at $20 a month, similar to GPT5 if you want more than just cursory usage.
Or am I missing something?
michaelt
For $20/month you get Opus in-browser chat access, and Sonnet claude code access.
If you want to use Opus in claude code, you've got to get the $100/month plan - or pay API prices. And agentic coding uses a lot of tokens.
sergiotapia
Yes you are missing something:
Claude Opus 4.1
Most intelligent model for complex tasks
Input $15 / MTok
Output $75 / MTok
Prompt caching
Write $18.75 / MTok
Read $1.50 / MTok
inquirerGeneral
[dead]
fouc
gpt-5 isn't supposed to be the best, it's supposed to be cost effective
senko
From OpenAI website:
> Our smartest, fastest, and most useful model yet
I'd say it's definitely supposed to be the best, it just doesn't deliver.
cheema33
>> Our smartest, fastest, and most useful model yet
> I'd say it's definitely supposed to be the best, it just doesn't deliver.
What part of "Our" is difficult to understand in that statement? Or are you claiming that OpenAI owns another model that is clearly better than GPT-5?
arcticfox
> Note that Claude 4 Sonnet isn’t the strongest model from Anthropic’s Claude series. Claude Opus is their most capable model for coding, but it seemed inappropriate to compare it with GPT-5 because it costs 10 times as much.
Well - I would have been interested in GPT-5 vs. Opus. Claude Code Max is affordable with Opus.
qeternity
> Claude Code Max is affordable with Opus
Because Anthropic is presumably massively subsidizing the usage.
kvirani
Isn't it all heavily subsidized by VC money at this time?
adventured
OpenAI for its part is tracking to $12-$15 billion in annual sales. If they slapped a basic ad model for referring onto what they're already doing, it's an easily profitable enterprise doing $30+ billion in sales next year. Frankly they should have already built and deployed that, it would make their free versions instantly profitable and they could boost usage limits and choke off the competition. It's the very straight-forward path to financially ruining their various weaker competition. Anthropic is Lyft in this scenario (and I say that as a big fan of Claude).
inquirerGeneral
[dead]
Filligree
Which doesn’t factor into my immediate decisions.
Nizoss
I have been using Claude Code with TDD through hooks, which significantly improved my workflow for production code.
Watching the ChatGPT 5 demo yesterday, I noticed most of the code seemed oriented towards one-off scripts rather than maintainable codebases which limits its value for me.
Does anyone know if ChatGPT 5 or Copilot have similar extensibility to enforce practices like TDD?
For context on the approach: https://github.com/nizos/tdd-guard
I use pre/post operation commands to enforce TDD rules.
ethan_smith
GPT-5 supports custom function calling which you could use to build similar TDD hooks via the API, though nothing as streamlined as your Claude Code implementation exists out-of-the-box yet.
MrGreenTea
I just recently stumbled upon your tdd-guard when looking for inspiration for Claude hooks. I've been so impressed with what it allowed me to improve the workflow and quality. Then I was somewhat disappointed that almost no one seems to talk about this potential and how they're using hooks. Yours was the only interesting project I found in this regard and hope to give it a spin this weekend .
You don't happen to have a short video where you go into a bit more detail on how you use it though?
Nizoss
Sorry, I missed the second part of your comment!
I don't have a detailed video beyond the short demo on the repo, but I'll look into recording something more comprehensive or cover it in a blog post. Happy to ping you when it's ready!
In the meantime: I simply set it up and go about my work. The only thing I really do is just nudge the agent into making architectural simplifications and make sure that it follows the testing strategies that I like: dependency injection, test helpers, test data factories and such. Things that I would do regardless of the hook.
I like to give my tests the same attention and care that I give production code. They should be meaningful and resilient. The code base contains plenty of examples but I will look into putting something together.
Nizoss
Thank you for the kind words, it means a lot!
I spent my summer holiday on this because I truly believe in the potential of hooks in agentic coding. I'm equally surprised that this space hasn't been explored more.
I'm currently working on making the validation faster and more customizable, plus adding reporters to support more languages.
I think there is an Amazon backed vscode forked that is also exploring this space. I think they market it as spec driven development.
Edit: I found it, its called Kiro: https://kiro.dev/
null
patcon
> One continuous difference: while GPT-5 would do lots of thinking then do something right the first time, Claude frantically tried different things — writing code, executing commands, making pretty dumb mistakes [...], but then recovering. This meant it eventually got to correct implementation with many more steps.
Sounds like Claude muddles. I consider that the stronger tactic.
I sure hope GPt-5 is muddling on the backend, else I suspect it will be very brittle.
Re: https://contraptions.venkateshrao.com/p/massed-muddler-intel...
> Lindblom’s paper identifies two patterns of agentic behavior, “root” (or rational-comprehensive) and “branch” (or successive limited comparisons), and argues that in complicated messy circumstances requiring coordinated action at scale, the way actually effective humans operate is the branch method, which looks like “muddling through” but gradually gets there, where the root ["godding through"] method fails entirely.
quijoteuniv
Today I used GPT-5 for some OpenTelemetry Collector configs that both Claude and OpenAI models struggled with before and it was surprisingly impressive. It got the replies right on the first try. Previously, both had been tripped up by outdated or missing docs (OTel changes so quickly).
For home projects, I wish I could have GPT-5 plugged into Claude’s code CLI interface. iteration just works! Looking forward to less baby sitting in the future!
ako
Claude code router: https://github.com/musistudio/claude-code-router
macawfish
Any success using this with GPT-5? I got it set up but haven't had a chance to run it through its paces yet. Seemed like it was more or less working when I tried it out, but GPT-5 was much less transparent about progress.
mattnewton
Cursor CLI is pretty close to Claude code- it’s missing a bunch of features like being able to manually compact or define sub agents, but the basic workflow is there and if you squint it’s pretty close to gpt-5 in Claude code.
I haven’t tried codex cli recently yet, I think it just got an update. That would be another to investigate.
sudohalt
Isn't the issue with that the prohibitive costs, it can easily be 5 to 10 (maybe even more for long running tasks). Currently they are probably subsidizing the compute costs to some extent.
kiitos
yeah gpt-5 does lots of thinking and then does something -- but it's rarely the right thing, at least in my experience over the past day
macawfish
Claude is just so well rounded and considerate. A lot of this probably comes down to prompt and context engineering, though surely there's something magical about Anthropic's principled training methodologies. They invented constitutional AI and I can only imagine that behind the scenes they're doing really cool stuff. Can't wait to see Claude 5!
mewpmewp2
So far from my testing I have found Claude Code with Sonnet 4 better than Cursor + GPT-5 still. I started exact same projects at the same time, and it seemed Claude Code was just more reliable. It was just much slower in terms of setting up the project and didn't setup the project up as scalably (despite them highlighting that in the demo), and when I tried to instruct it to set it up DRY, modular, etc it kind of didn't just go where I wanted it to, while Claude Code did.
It was a game involving OOP, three.js. I think both are probably great at good design and CRUD things.
natiman1000
I was initially excited about GPT5, and I quickly switched to it but still can't use it for some reason it is clearly smart but not useful.
mewpmewp2
I'm getting the same thing I got with Codex when I tried right now. I give it a command, and it keeps reading files and thinking for 5 min+, this never happens with Claude Code.
visarga
I tried GPT-5 in Cursor today and got the same thing - it started reading and thinking and thinking, and it was quite repetitive. It was unexpected. I wrote it off as "maybe the Cursor GPT-5 prompts are not refined yet".
dwaltrip
Oh I tried codex for the first time last night with gpt-5. It looked stuck twice when it used the grep tool (after working successfully for minute or so), and both times I canceled after seeing no output for more than a minute.
It would have eventually finished?
nextworddev
GPT-5 is much cheaper though
mewpmewp2
I'm using Claude Code which is $200/month, and I do multiple agents, subagents, terminals at the same time, much faster than Cursor. I get almost 24/7 dev time from that.
ramoz
Not using Claude code is a crime.
anotheryou
I think we need to stop testing models raw.
Claude is trained for claude code and that's how it's used in the field too.
nightshift1
unless you use it through copilot
anotheryou
Claude? why would you
rezistik
My work pays for copilot subscriptions, they don't pay for claude code.
fouc
Claude Code can be used through VS code too. As for why? Some people prefer IDEs over terminals.
Personally I think the attempts to combine LLM coding with current IDE UIs, a la Cursor/Windsurf/VS Code is probably the wrong way to go, it feels too awkward and cumbersome. I like a more interactive interface, and Claude Code is more in line with that.
doctoboggan
I really like Claude code's context engineering and prompt engineering, is it possible to plug in GPT-5 into Claude code? I think that would be a more apples to apples test as it's just testing the models and not the agentic framework around them.
koakuma-chan
I imagine Claude Code is optimized for Claude specifically, and GPT-5 would not be great there. You should probably use Codex if you want to use GPT-5.
lherron
Did I miss the total cost for each run in the article? Can't seem to find it.
If Sonnet is more expensive AND more chatty/requires more attempts for the same result, seems like that would favor GPT5 for daily driver.
Is it really this easy now to get your article high on HN with 100 comments? The findings are completely meaningless.
"Agenticness" depends so much on the specific tooling (harness) and system prompts. It mentions Copilot - did it use this for both? Given it's created by Microsoft there's good reason to believe it'd be built yo do especially well with GPT (they'll have had 5 available in preview for months by now). Or it could be the opposite and be tuned towards Sonnet. At the very minimum you'd need to try a few different harnesses, preferably ones not closely related to either OpenAI/MS or Anthropic.
This article even mentions things like "Sonnet is much faster" which is very dependent on the specific load at the time of usage. Today everyone is testing GPT-5 so it's slow and Sonnet is much faster.