Skip to content(if available)orjump to list(if available)

Claude Opus 4.5

llamasushi

The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.

Also notable: they're claiming SOTA prompt injection resistance. The industry has largely given up on solving this problem through training alone, so if the numbers in the system card hold up under adversarial testing, that's legitimately significant for anyone deploying agents with tool access.

The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.

losvedir

I almost scrolled past the "Safety" section, because in the past it always seemed sort of silly sci-fi scaremongering (IMO) or things that I would classify as "sharp tool dangerous in the wrong hands". But I'm glad I stopped, because it actually talked about real, practical issues like the prompt injections that you mention. I wonder if the industry term "safety" is pivoting to refer to other things now.

tekacs

This is also super relevant for everyone who had ditched Claude Code due to limits:

> For Claude and Claude Code users with access to Opus 4.5, we’ve removed Opus-specific caps. For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet. We’re updating usage limits to make sure you’re able to use Opus 4.5 for daily work.

Scene_Cast2

Still way pricier (>2x) than Gemini 3 and Grok 4. I've noticed that the latter two also perform better than Opus 4, so I've stopped using Opus.

wolttam

It's 1/3 the old price ($15/$75)

brookst

Not sure if that’s a joke about LLM math performance, but pedantry requires me to point out 15 / 75 = 1/5

l1n

15$/Megatoken in, 75$/Megatoken out

null

[deleted]

conradkay

they mean it used to be $15/m input and $75/m output tokens

llamasushi

Just updated, thanks

827a

I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think Anthropic is making the right decisions with their models. Given that software engineering is probably one of the very few domains of AI usage that is driving real, serious revenue: I have far better feelings about Anthropic going into 2026 than any other foundation model. Excited to put Opus 4.5 through its paces.

mritchie712

> only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think part of it is this[0] and I expect it will become more of a problem.

Claude models have built-in tools (e.g. `str_replace_editor`) which they've been trained to use. These tools don't exist in Cursor, but claude really wants to use them.

0 - https://x.com/thisritchie/status/1944038132665454841?s=20

HugoDias

TIL! I'll finally give Claude Code a try. I've been using Cursor since it launched and never tried anything else. The terminal UI didn't appeal to me, but knowing it has better performance, I'll check it out.

Cursor has been a terrible experience lately, regardless of the model. Sometimes for the same task, I need to try with Sonnet 4.5, ChatGPT 5.1 Codex, Gemini Pro 3... and most times, none managed to do the work, and I end up doing it myself.

At least I’m coding more again, lol

vunderba

My workflow was usually to use Gemini 2.5 Pro (now 3.0) for high-level architecture and design. Then I would take the finished "spec" and have Sonnet 4.5 perform the actual implementation.

nevir

Same here. Gemini really excels at all the "softer" parts of the development process (which, TBH, feels like most of the work). And Claude kicks ass at the actual code authoring.

It's a really nice workflow.

SkyPuncher

This is how I do it. Though, I've been using Composer as my main driver more an more.

* Composer - Line-by-Line changes * Sonnet 4.5 - Task planning and small-to-medium feature architecture. Pass it off to Composer for code * Gemini Pro - Large and XL architecture work. Pass it off to Sonnet to breakdown into tasks.

config_yml

I use plan mode in claude code, then use gpt-5 in codex to review the plan and identify gaps and feed it back to claude. Results are amazing.

jeswin

Same here. But with GPT 5.1 instead of Gemini.

vessenes

I like this plan, too - gemini's recent series have long seemed to have the best large context awareness vs competing frontier models - anecdotally, although much slower, I think gpt-5's architecture plans are slightly better.

UltraSane

I've done this and it seems to work well. I ask Gemini to generate a prompt for Claude Code to accomplish X

emodendroket

Yeah I think Sonnet is still the best in my experience but the limits are so stingy I find it hard to recommend for personal use.

lvl155

I really don’t understand the hype around Gemini. Opus/Sonnet/GPT are much better for agentic workflows. Seems people get hyped for the first few days. It also has a lot to do with Claude code and Codex.

egeozcan

I'm completely the opposite. I find Gemini (even 2.5 Pro) much, much better than anything else. But I hate agentic flows, I upload the full context to it in aistudio and then it shines - anything agentic cannot even come close.

jdgoesmarching

Personally my hype is for the price, especially for Flash. Before Sonnet 4.5 was competitive with Gemini 2.5 Pro, the latter was a much better value than Opus 4.1.

chinathrow

I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

It gave me the Youtube-URL to Rick Astley.

arghwhat

If you're asking an LLM to compute something "off the top of its head", you're using it wrong. Ask it to write the code to perform the computation and it'll do better.

Same with asking a person to solve something in their head vs. giving them an editor and a random python interpreter, or whatever it is normal people use to solve problems.

hu3

> I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

This is what I imagine the LLM usage of people who tell me AI isn't helpful.

It's like telling me airplanes aren't useful because you can't use them in McDonald's drive-through.

mikestorrent

You should probably tell AI to write you programs to do tasks that programs are better at than minds.

stavros

Don't use LLMs for a task a human can't do, they won't do it well.

wmf

A human could easily come up with a base64 -d | jq oneliner.

gregable

it. Not him.

chinathrow

It's Claude. Where I live, that is a male name.

visioninmyblood

The model is great it is able to code up some interesting visual tasks(I guess they have pretty strong tool calling capapbilities). Like orchestrate prompt -> image generate -> Segmentation -> 3D reconstruction. Checkout the results here https://chat.vlm.run/c/3fcd6b33-266f-4796-9d10-cfc152e945b7. Note the model was only used to orchestrate the pipeline, the tasks are done by other models in an agentic framework. They much have improved tool calling framework with all the MCP usage. Gemini 3 was able to orchestrate the same but Claude 4.5 is much faster

rishabhaiover

I suspect Cursor is not the right platform to write code on. IMO, humans are lazy and would never code on Cursor. They default to code generation via prompt which is sub-optimal.

viraptor

> They default to writing code via prompt generation which is sub-optimal.

What do you mean?

rishabhaiover

If you're given a finite context window, what's the most efficient token to present for a programming task? sloppy prompts or actual code (using it with autocomplete)

Squarex

I have heard that gemini 3 is not that great in cursor, but excellent in Antigravity. I don't have a time to personally verify all that though.

config_yml

I‘ve had no success using Antigravity, which is a shame because the ideas are promising, but the execution so far is underwhelming. Haven‘t gotten past an initial plannin doc which is usually aborted due to model provider overload or rate limiting.

itsdrewmiller

My first couple of attempts at antigravity / Gemini were pretty bad - the model kept aborting and it was relatively helpless at tools compared to Claude (although I have a lot more experience tuning Claude to be fair). Seems like there are some good ideas in antigravity but it’s more like an alpha than a product.

koakuma-chan

Nothing is great in Cursor.

incoming1211

I think gemini 3 is hot garbage in everything. Its great on a greenfield trying to 1 shot something, if you're working on a long term project it just sucks.

unsupp0rted

This is gonna be game-changing for the next 2-4 weeks before they nerf the model.

Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.

Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases” might have lead to less than optimal performance. Still largely a user skill issue though.

Then a couple months later they’ll release Opus 4.7 and go through the cycle again.

My allegiance to these companies is now measured in nerf cycles.

I’m a nerf cycle customer.

film42

This is why I migrated my apps that need an LLM to Gemini. No model degradation so far all through the v2.5 model generation. What is Anthropic doing? Swapping for a quantized version of the model?

blurbleblurble

I hope this comment makes it to the top.

Gpt-5.1-* are all fully nerfed for me. Maybe they're giving others the real juice but they're not giving it to me.

My sense is that they give me the real deal until I fork over $200, then they proceed to make me sit there for hours babysitting an LLM that does nothing useful and takes 20 minutes at a time for each nothing increment.

hebejebelus

On my Max plan, Opus 4.5 is now the default model! Until now I used Sonnet 4.5 exclusively and never used Opus, even for planning - I'm shocked that this is so cheap (for them) that it can be the default now. I'm curious what this will mean for the daily/weekly limits.

A short run at a small toy app makes me feel like Opus 4.5 is a bit slower than Sonnet 4.5 was, but that could also just be the day-one load it's presumably under. I don't think Sonnet was holding me back much, but it's far too early to tell.

bnchrch

Seeing these benchmarks makes me so happy.

Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.

This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.

Ive been holding onto Claude Code for the last little while since Ive built up a robust set of habits, slash commands, and sub agents that help me squeeze as much out of the platform as possible.

But with the last few releases of Gemini and Codex I've been getting closer and closer to throwing it all out to start fresh in a new ecosystem.

Thankfully Anthropic has come out swinging today and my own SOP's can remain in tact a little while longer.

bavell

Same boat and same thoughts here! Hope it holds its own against the competition, I've become a bit of a fan of Anthropic and their focus on devs.

tordrt

I tried codex due to the same reasoning you list. The grass is not greener on the other side.. I usually only opt for codex when my claude code rate limit hits.

edf13

I’m threw a few hours at Codex the other day and was incredibly disappointed with the outcome…

I’m a heavy Claude code user and similar workloads just didn’t work out well for me on Codex.

One of the areas I think is going to make a big difference to any model soon is speed. We can build error correcting systems into the tools - but the base models need more speed (and obviously with that lower costs)

wahnfrieden

You need much less of a robust set of habits, commands, sub agent type complexity with Codex. Not only because it lacks some of these features, it also doesn't need them as much.

futureshock

A really great way to get an idea of the relative cost and performance of these models at their various thinking budgets is to look at the ARC-AGI-2 leaderboard. Opus 4.5 stacks up very well here when you compare to Gemini 3’s score and cost. Gemini 3 Deep Think is still the current leaders but at more than 30x the cost.

The cost curve of achieving these scores is coming down rapidly. In Dec 2024 when OpenAI announced beating human performance on ARC-AGI-1, they spent more than $3k per task. You can get the same performance for pennies to dollars, approximately an 80x reduction in 11 months.

https://arcprize.org/leaderboard

https://arcprize.org/blog/oai-o3-pub-breakthrough

jasonthorsness

I used Gemini instead of my usual Claude for a non-trivial front-end project [1] and it really just hit it out of the park especially after the update last week, no trouble just directly emitting around 95% of the application. Now Claude is back! The pace of releases and competition seems to be heating up more lately, and there is absolutely no switching cost. It's going to be interesting to see if and how the frontier model vendors create a moat or if the coding CLIs/models will forever remain a commodity.

[1] https://github.com/jasonthorsness/tree-dangler

hu3

Gemini is indeed great for frontend HTML + CSS and even some light DOM manipulation in JS.

I have been using Gemini 2.5 and now 3 for frontend mockups.

When I'm happy with the result, after some prompt massage, I feed it to Sonnet 4.5 to build full stack code using the framework of the application.

jumploops

> Pricing is now $5/$25 per million tokens

So it’s 1/3 the price of Opus 4.1…

> [..] matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens

…and potentially uses a lot less tokens?

Excited to stress test this in Claude Code, looks like a great model on paper!

alach11

This is the biggest news of the announcement. Prior Opus models were strong, but the cost was a big limiter of usage. This price point still makes it a "premium" option, but isn't prohibitive.

Also increasingly it's becoming important to look at token usage rather than just token cost. They say Opus 4.5 (with high reasoning) used 50% fewer tokens than Sonnet 4.5. So you get a higher score on SWE-bench verified, you pay more per token, but you use fewer tokens and overall pay less!

null

[deleted]

jmkni

> Pricing is now $5/$25 per million tokens

For anyone else confused, it's input/output tokens

$5 for 1million tokens in $25 for 1million tokens out

mvdtnz

What prevents these jokers from making their outputs ludicrously verbose to squeeze more out of you, given they charge 5x more for the end that they control? Already model outputs are overly verbose, and I can see this getting worse as they try to squeeze some margin. Especially given that many of the tools conveniently hide most of the output.

stavros

Did anyone else notice Sonnet 4.5 being much dumber recently? I tried it today and it was really struggling with some very simple CSS on a 100-line self-contained HTML page. This never used to happen before, and now I'm wondering if this release has something to do with it.

On-topic, I love the fact that Opus is now three times cheaper. I hope it's available in Claude Code with the Pro subscription.

EDIT: Apparently it's not available in Claude Code with the Pro subscription, but you can add funds to your Claude wallet and use Opus with pay-as-you-go. This is going to be really nice to use Opus for planning and Sonnet for implementation with the Pro subscription.

However, I noticed that the previously-there option of "use Opus for planning and Sonnet for implementation" isn't there in Claude Code with this setup any more. Hopefully they'll implement it soon, as that would be the best of both worlds.

EDIT 2: Apparently you can use `/model opusplan` to get Opus in planning mode. However, it says "Uses your extra balance", and it's not clear whether it means it uses the balance just in planning mode, or also in execution mode. I don't want it to use my balance when I've got a subscription, I'll have to try it and see.

vunderba

Anecdotally, I kind of compare the quality of Sonnet 4.5 to that of a chess engine: it performs better when given more time to search deeper into the tree of possible moves (more plies). So when Anthropic is under peak load I think some degradation is to be expected. I just wish Claude Code had a "Signal Peak" so that I could schedule more challenging tasks for a time when its not under high demand.

pton_xd

I hate jumping in on these model-nerf conspiracy bandwagons but I have to admit, the amount of "wait, that's wrong!" interjections from Sonnet 4.5 recently has been noticeable. And then it still won't get the answer right after backtracking. Very weird.

kjgkjhfkjf

My guess is that Claude's "bad days" are due to the service becoming overloaded and failing over to use cheaper models.

beydogan

100% dumber, especially since last 3-4 days. I have two guesses:

- They make it dumber close to a new release to hype the new model

- They gave $1000 Claude Code Web credits to a lot of people, which increased the load a lot so they had to serve quantized version to handle the it.

I love Claude models but I hate this non transparency and instability.

bryanlarsen

On Friday my Claude was particularly stupid. It's sometimes stupid, but I've never seen it been that consistently stupid. Just assumed it was a fluke, but maybe something was changing.

saaaaaam

Anecdotally, I’ve been using opus 4.5 today via the chat interface to review several large and complex interdependent documents, fillet bits out of them and build a report. It’s very very good at this, and much better than opus 4.1. I actually didn’t realise that I was using opus 4.5 until I saw this thread.

GenerWork

I wonder what this means for UX designers like myself who would love to take a screen from Figma and turn it into code with just a single call to the MCP. I've found that Gemini 3 in Figma Make works very well at one-shotting a page when it actually works (there's a lot of issues with it actually working, sadly), so hopefully Opus 4.5 is even better.