Claude Opus 4.1

191 comments

·August 5, 2025

ryandrake

Am I the only one super confused about how to even get started trying out this stuff? Just so I wouldn't be "that critic who doesn't try the stuff he criticizes," I tried GitHub Copilot and was kind of not very impressed. Someone on HN told me Copilot sucks, use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.

Let's see: we have Claude Code vs. Claude the API vs. Claude the website, and they're totally different from each other? One is command line, one integrates into your IDE (which IDE?) and one is just browser based, I guess. Then you have the different pricing plans, Free, Pro, and Max? But then there's also Claude Team and Claude Enterprise? These are monthly plans that only work with Claude the Website, but Claude Code is per-request? Or is it Claude API that's per-request? I have no idea. Then you have the models: Claude Opus and Claude Sonnet, with various version numbers for each?? Then there's Cline and Cursor and GOOD GRIEF! I just want to putz around with something in VSCode for a few hours!

AlecSchueler

I'm not sure what's complicated about what you're describing? They offer two models and you can pay more for higher usage limits, then you can choose if you want to run it in your browser or in your terminal. Like what else would you expect?

Fwiw I have a Claude pro plan and have no interest in using other offerings so I'm not sure if they're super simple (one model, one interface, one pricing plan)?

onlyrealcuzzo

When people post this stuff, it's like, are you also confused that Nike sells shoes AND shorts AND shirts, and there's different colors and skus for each article of clothing, and sometimes they sell direct to consumer and other times to stores and to universities, and also there's sales and promotions, etc, etc?

It's almost as if companies sell more than one product.

Why is this the top comment on so many threads about tech products?

furyofantares

In this case, they tried something and were told they were doing it wrong, and they know there's more than one way to do it wrong - wrong model, wrong tool using the model, wrong prompting, wrong task that you're trying to use it for.

And of course you could be doing it right but the people saying it works great could themselves be wrong about how good it is.

On top of that it costs both money and time/effort investment to figure out if you're doing it wrong. It's understandable to want some clarity. I think it's pretty different from buying shoes.

kelnos

Because the offerings are not simple. Your Nike example is silly; everyone knows what to do with shoes and shorts and shirts, and why they might want (or not want) to buy those particular items from Nike.

But for someone who hasn't been immersed in the "LLM scene", it's hard to understand why you might want to use one particular model of another. It's hard to understand why you might want to do per-request API pricing vs. a bucketed usage plan. This is a new technology, and the landscape is changing weekly.

I think maybe it might be nice if folks around here were a bit more charitable and empathetic about this stuff. There's no reason to get all gatekeep-y about this kind of knowledge, and complaining about these questions just sounds condescending and doesn't do anyone any good.

tomrod

> Why is this the top comment on so many threads about tech products?

Because you overestimate the difference that the representative person understands.

A more accurate analogy is that Nike sells green-blue shoes and Nike sells blue-green shoes, but the blue-green shoes add 3 feet to your jump and green-blue shoes add 20 mph to your 100 yard dash sprint.

You know you need one of them for tomorrow's hurdles race but have no idea which is meaningful for your need.

squeaky-clean

Which Nike shoe is best for basketball? The Nike Dunk, Air Force 1, Air Jordan, LeBron 20, LeBron XXI Prime 93, Kobe IX elite, Giannis Freak 7, GT Cut, GT Cut 3, GT Cut 3 Turbo, GT Hustle 3, or the KD18?

At least with those you can buy whatever you think is coolest. Which Claude model and interface should the average programmer use?

gmueckl

When you walk into a store, you can see and touch all of these products. It's intuitive.

With all this LLM cruft all you get is essentially the same old chat interface that's like the year 2000 called and wants its on-line chat websites back. The only thing other than a text box that you usually get is a model selector dropdown squirreled away in a corner somewhere. And that dropdown doesn't really explain the differences between the cryptic sounding options (GPT-something, Claude Whatever...). Of course this confuses people!

ryandrake

Hey, I'm open to the idea that I'm just stupid. But, if people in your target market (software developers) don't even understand your product line and need a HOWTO+glossary to figure it out, maybe there's also a branding/messaging/onboarding problem?

pdntspa

Because few seem to want to expend the effort to dive in and understand something. Instead they want the details spoonfed to them by marketing or something.

I absolutely loathe this timeline we're stuck in.

margalabargala

If anything, Anthropic has the product lineup that makes the most sense. Higher numbers mean better model. Haiku < Sonnet < Opus which translates to length/size. Free < Pro < Max.

Contrast to something like OpenAI. They've got gpt4.1, 4o, and o4. Which of these are newer than one another? How do people remember which of o4 and 4o are which?

windsignaling

On the contrary, I'm confused about why you're confused.

This is a well-known and documented phenomenon - the paradox of choice.

I've been working in machine learning and AI for nearly 20 years and the number of options out there is overwhelming.

I've found many of the tools out there do some things I want, but not others, so even finding the model or platform that does exactly what I want or does it the best is a time-consuming process.

Filligree

You need Claude Pro or Max. The website subscription also allows you to use the command line tool—the rate limits are shared—and the command line tool includes IDE integration, at least for VSCode.

Claude Code is currently best-in-class, so no point in starting elsewhere, but you do need to read the documentation.

wahnfrieden

Correct. Claude Code Max with Opus. Don’t even bother with Sonnet.

andsoitis

> use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.

Anthropic has this useful quick start guide: https://docs.anthropic.com/en/docs/claude-code/quickstart

robluxus

> I just want to putz around with something in VSCode for a few hours!

I just googled "using claude from vscode" and the first page had a link that brought me to anthropic's step by step guide on how to set this up exactly.

Why care about pricing and product names and UI until it's a problem?

> Someone on HN told me Copilot sucks, use Claude.

I concur, but I'm also just a dude saying some stuff on HN :)

kelnos

If you're looking for a coding assistant, get Claude Code, and give it a try. I think you need the Pro plan at a minimum for that ($20/mo; I don't think Free includes Claude Code). Don't do the per-request API pricing as it can get expensive even while just playing around.

Agree that the offering is a bit confusing and it's hard to know where to start.

Just FYI: Claude Code is a terminal-based app. You run it in the working directory of your project, and use your regular editor that you're used to, but of course that means there's no editor integration (unlike something like Cursor). I personally like it that way, but YMMV.

prinny_

What exactly did you try with GitHub copilot? It’s not an LLM itself, just in interface for an LLM. I have copilot in my professional GitHub account and I can choose between chat-gpt and Claude.

vlade11115

Claude Code has two usage modes: pay-per-token or subscription. Both modes are using API under the hood, but with subscription mode you are only paying a fixed amount a month. Each subscription tier has some undisclosed limits, cheaper plans have lower usage limits. So I would recommend paying $20 and trying the Claude Code via that subscription.

kace91

I’m looking for cursor alternatives after confusing pricing changes. Is Claude code an option? Can be integrated on an editor/ide for similar results?

My use case so far is usually requesting mechanic work I would rather describe than write myself like certain test suites, and sometimes discovery on messy code bases.

dennisy

No Opus in the $20 tier though sadly

oblio

What does Opus do extra?

joshmarlow

VSCode has a pretty good Gemini integration - it can pull up a chat window from the side. I like to discuss design changes and small refactorings ("I added this new rpc call in my protobuf file, can you go ahead and stub out the parts of code I need to get this working in these 5 different places?") and it usually does a pretty darn good job of looking at surrounding idioms in each place and doing what I want. But gemini can be kind of slow here.

But I would recommend just starting using Claude in the browser, talk through an idea for a project you have and ask it to build it for you. Go ahead and have a brain storming session before you actually ask it to code - it'll help make sure the model has all of the context. Don't be afraid to overload it with requirements - it's generally pretty good at putting together a coherent plan. If the project is small/fits in a single file - say a one page web app or a complicated data schema + sql queries - then it can usually do a pretty good job in one place. Then just copy+paste the code and run it out of the browser.

This workflow works well for exploring and understanding new topics and technologies.

Cursor is nice because it's an AI integrated IDE (smoother than the VSCode experience above) where you can select which models to use. IMO it seems better at tracking project context than Gemini+VSCode.

Hope this helps!

qsort

All three major labs released something within hours of each other. This anime arc is insane.

Etheryte

This is why you have PR departments. Being on top of the HN front page, news sites, etc matters a lot. Even if you can't be the first, it's important to dilute the attention as much as possible to reduce the limelight your competitors get.

x187463

Given the GPT5 rumors, August is just getting started.

kridsdale3

Given the Gregorian Calendar and the planet's path through its orbit, August is just getting started.

tomrod

This legitimately made me chuckle.

ozgung

What a time to be alive

tonyhart7

as if they wait competitor first then launch it at the same time to make market decide which one is best

torginus

I think this means that GPT5 is better - you can't launch a worse model after the competitor supersedes you - you have to show that you're in the lead even if its just for a day.

candiddevmike

It's definitely a coincidence

wilg

It's not a coincidence or a cartel, it's PR counterprogramming.

null

[deleted]

vFunct

None of them seem to have published any papers associated with them on how these new models advanced the state-of-the-art though. =^(

jzig

I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certain things while using Sonnet for others?

SkyPuncher

I don't doubt Opus is technically superior, but it's not practically superior for me.

It's still pretty much impossible to have any LLM one-shot a complex implementation. There's just too many details to figure out and too much to explain for it to get correct. Often, there's uncertainty and ambiguity that I only understand the correct answer (or rather less bad answer) after I've spent time deep in the code. Having Opus spit out a possibly correct solution just isn't useful to me. I need to understand _why_ we got to that solution and _why_ it's a correct solution for the context I'm working in.

For me, this means that I largely have an iteratively driven implementation approach where any particular task just isn't that complex. Therefore, Sonnet is completely sufficient for my day-to-day needs.

bdamm

I've been having a great time with Windsurf's "Planning" feature. Have a nice discussion with Cascade (Claude) all about what it is that neerds to happen - sometimes a very long conversation including test code. Then when everything is very clear, make it happen. Then test and debug the results with all that context. Pretty nice.

ssk42

You can also always have it create design docs and mermaid diagrams for each task. Outline the why much easier earlier, shifting left

adastra22

Every time that Sonnet is acting like it has brain damage (which is once or twice a day), I switch to Opus and it seems to sort things out pretty fast. This is unscientific anicdata though, and it could just be that switching models (any model) would have worked.

gpm

This seems like a case of reversion to the mean. When one model is performing below average, changing anything (like switching to another model) is likely to improve it by random chance...

keeeba

Anthropic say Opus is better, benchmarks & evals say Opus is better, Opus has more parameters and parameters determine how much a NN can learn.

Maybe Opus just is better

monatron

This is a great use case for sub-agents IMO. By default, sub-agents use sonnet. You can have opus orchestrate the various agents and get (close to) the best of both worlds.

adastra22

Is there a way to get persistent sub-agents? I'd love to have a bunch of YAML files in my repository, one for each sub-agent, and have those automatically used across all Claude Code instances I have on multiple machines (I dev on laptop and desktop), or across the team.

HarHarVeryFunny

Maybe context rot? If model's output seems to be getting worse or in a rut, then try just clearing context / starting a new session.

adastra22

Switching models with the same context, in this case.

j45

They both seem to behave differently depending on how loaded the system seems to be.

api

I have suspected for a long time that hosted models load shed by diverting some requests to lesser models or running more quantized versions under high load.

anonzzzies

Exactly that.

Uehreka

> yet the general consensus and my own experience seem to be that Sonnet is much much better

Given that there’s nothing close to scientific analysis going on, I find it hard to tell how big the “Sonnet is overall better, not just sometimes” crowd is. I think part of the problem is that “The bigger model is better” feels obvious to say, so why say it? Whereas “the smaller model is better actually” feels both like unobvious advice and also the kind of thing that feels smart to say, both of which would lead to more people who believe it saying it, possibly creating the illusion of consensus.

I was trying to dig into this yesterday, but every time I come across a new thread the things people are saying and the proportions saying what are different.

I suppose one useful takeaway is this: If you’re using Claude Max and get downgraded from Opus to Sonnet for a few hours, you don’t have to worry too much about it being a harsh downgrade in quality.

MostlyStable

Opus seems better to me on long tasks that require iterative problem solving and keeping track of the context of what we have already tried. I usually switch to it for any kind of complicated troubleshooting etc.

I stick with Sonnet for most things because it's generally good enough and I hit my token limits with it far less often.

unshavedyak

Same. I'm on the $200 plan and I find Opus "better", but Sonnet is more straight forward. Sonnet is, to me, a "don't let it think" model. It does great if you give it concrete and small goals. Anything vague or broad and it starts thinking and it's a problem.

Opus gives you a bit more rope to hang yourself with imo. Yes, it "thinks" slightly better, but still not good enough to me. But it can be good enough to convince you that it can do the job.. so i dunno, i almost dislike it in this regard. I find Sonnet just easier to predict in this regard.

Could i use Opus like i do Sonnet? Yes definitely, and generally i do. But then i don't really see much difference since i'm hand-holding so much.

jm4

I use both. Sonnet is faster and more cost efficient. It's great for coding. Where Opus is noticeably better is in analysis. It surpasses Sonnet for debugging, finding patterns in data, creativity and analysis in general. It doesn't make a lot of sense to use Opus exclusively unless you're on a max20 plan and not hitting limits. Using Opus for design and troubleshooting and Sonnet for everything else is a good way to go.

biinjo

Im on the Max plan and generally Opus seems to do better work than Sonnet. However, that’s only when they allow me to use Opus. The usage limits, even on the max plan, are a joke. Yesterday I hit the limits within MINUTES of starting my work day.

furyofantares

I'm a bit confused by people hitting usage limits so quickly.

I use Opus exclusively and don't hit limits. ccusage reports I'm using the API-equivalent of $2000/mo

Bolwin

That's insane. Are you accounting for caching? If not, there's no way this is going to last

rirze

You always have to ask which plan they're paying for. Sometimes people complain about the $20 per month plan...

dsrtslnd23

same here constantly hit the Opus limits after minutes on Max plan

epolanski

Yeah, you need to actively cherry pick which model to use in order to not waste tokens on stuff that would be easily handed by a simpler model.

dested

If I'm using cursor then sonnet is better, but in claude code Opus 4 is at least 3x better than Sonnet. As with most things these days, I think a lot of it comes down to prompting.

jzig

This is interesting. I do use Cursor with almost exclusively Sonnet and thinking mode turned on. I wonder if what Cursor does under the hood (like their indexing) somehow empowers Sonnet more. I do not have much experience with using Claude Code.

astrostl

With aggressive Claude Code use I didn't find Sonnet better than Opus but I did find it faster while consuming far fewer tokens. Once I switched to the $100 Max plan and configured CC to exclusively use Sonnet I haven't run into a plan token limit even once. When I saw this announcement my first thing was to CMD-F and see when Sonnet 4.1 was coming out, because I don't really care about Opus outside of interactive deep research usage.

djha-skin

Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, spinning in circles. OpenAI is better, but still falls short of Claude's performance. Claude also gives back 400's from its API if you CTRL-C in the middle though, so that's annoying.

Economics is important. Best bang for the buck seems to be OpenAI ChatGPT 4.1 mini[6]. Does a decent job, doesn't flood my context window with useless tokens like Claude does, API works every time. Gets me out of bad spots. Can get confused, but I've been able to muddle through with it.

1: https://openrouter.ai/anthropic/claude-opus-4.1

2: https://openrouter.ai/anthropic/claude-sonnet-4

3: https://block.github.io/goose/

4: https://openrouter.ai/anthropic/claude-3.5-sonnet

5: https://openrouter.ai/google/gemini-2.5-flash

6: https://openrouter.ai/openai/gpt-4.1-mini

generalizations

Get a subscription and use claude code - that's how you get actual reasonable economics out of it. I use claude code all day on the max subscription and maybe twice in the last two weeks have I actually hit usage limits.

tgtweak

Is it considerably more cost effective than cline+sonnet api calls with caching and diff edits?

Same context length and throughput limits?

Anecdotally I find gpt4.1 (and mini) were pretty good at those agentic programming tasks but the lack of token caching made the costs blow up with long context.

bavell

I'm on the basic $20/mo sub and only ran into token cap limitations in the first few days of using Claude Code (now 2-3 weeks in) before I started being more aggressive about clearing the context. Long contexts will eat up tokens caps quickly when you are having extended back-and-forth conversations with the model. Otherwise, it's been effectively "unlimited" for my own use.

null

[deleted]

kroaton

GLM 4.5 / Kimi K2 / Qwen Coder 3 / Gemini Pro 2.5

gusmally

They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon

(He had been stuck in the Team Rocket hideout (I believe) for weeks)

taormina

Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.

At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.

I've basically wasted the morning on Claude Code when I should've just been doing it all myself.

AlecSchueler

I've also noticed Sonnet starting to degrade. It's developing some of the behaviours that put me off the competition in the first place. Needless explanations, filler in responses, wanting to put everything in lists, even increased sycophancy.

bavell

> I've basically wasted the morning on Claude Code when I should've just been doing it all myself.

Welcome to the machine

https://www.youtube.com/watch?v=tBvAxSx0nAM&t=45s

thoop

The article says "We plan to release substantially larger improvements to our models in the coming weeks."

Sonnet 4 has definitely been the best model for our product's use case, but I'd be interested in trying Haiku 4 (or 4.1?) just due to the cost savings.

I'm surprised Anthropic hasn't mentioned anything about Haiku 4 yet since they released the other models.

steveklabnik

This is the bit I'm most interested in:

> We plan to release substantially larger improvements to our models in the coming weeks.

machiaweliczny

This is so people don't immediately migrate for GPT5

haaz

it is barely an improvement according to their own benchmarks. not saying thats a bad thing, but not enough for anybody to notice any difference

waynenilsen

i think its probably mostly vibes but that still counts, this is not in the charts

> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

Topfi

I am still very early, but output quality wise, yes, there does not seem to be any noticeable improvement in my limited personal testing suite. What I have noticed though is subjectively better adherence to instructions and documentation provided outside the main prompt, though I have no way to quantify or reliably test that yet. So beyond reliably finding Needles-in-the-Haystack (which Frontier models have done well on lately), Opus 4.1 seems to do better in following those needles even if not explicitly guided to compared to Opus 4.

onlyrealcuzzo

I will only add that it's interesting that in the results graphic, they simply highlighted Opus 4.1 - choosing not to display which models have the best scores - as Opus 4.1 only scored the best on about half of the benchmarks - and was worse than Opus 4.0 on at least one measure.

ttoinou

That's why they named it 4.1 and not 4.5

zamadatix

When it's "that's why they incremented the version by a tenth instead of a half" you know things have really started to slow for the large models.

phonon

Opus 4 came out 10 weeks ago. So this is basically one new training run improvement.

mclau157

They released this because competitors are releasing things

gloosx

They need to leave some room to release 10 more models. They could crank benchmarks to 100% but then no new model is needed lol? Pretty sure these pretty benchmark graphs are all completely staged marketing numbers since they do solve the same problems they are being trained on – no novel or unknown problematic is presented to them.

levocardia

"You pay $20/mo for X, and now I'm giving you 1.05*X for the same price." Outrageous!

leetharris

Good! I'm glad they are just giving us small updates. Opus 4 just came out, if you have small improvements, why not just release them? There's no downside for us.

AstroBen

I don't think this could even be called an improvement? It's small enough that it could just be random chance

j_bum

I’ve always wondered about this actually. My assumption is that they always “pick the best” result from these tests.

Instead, ideally they’d run the benchmark tests many times, and share all of the results so we could make statistical determinations.

minimaxir

This likely won't move the needle for Opus use over Sonnet while the cost remains the same. Using OpenRouter rankings (https://openrouter.ai/rankings) as a proxy, Sonnet 3.7 and Sonnet 4 combined generates 17x more tokens than Opus 4.

P24L

The improved Opus isn’t about achieving significantly better peak performance for me. It’s not about pushing the high end of the spectrum. Instead, it’s about consistently delivering better average results - structuring outputs more effectively, self-correcting mistakes more reliably, and becoming a trustworthy workhorse for everyday tasks.

NitpickLawyer

Cheekily announcing during oAI's oss model launch :D