Claude 3.7 Sonnet and Claude Code
517 comments
·February 24, 2025anotherpaulg
gwd
Interesting that the "correct diff format" score went from 99.6% with Claude 3.5 to 93.3% for Claude 3.7. My experience with using claude-code was that it consistently required several tries to get the right diff. Hopefully all that will improve as they get things ironed out.
bearjaws
Thanks for all the work on aider, my favorite AI tool.
stavros
I'd like to second the thanks for Aider, I use it all the time.
liamYC
I’d like to 3rd the thanks for Aider it’s fantastic!
throwaway454812
Any chance you can add support for Vertex AI Sonnet 3.7, which looks like it's available now? Thank you!
bcherny
Hi everyone! Boris from the Claude Code team here. @eschluntz, @catherinewu, @wolffiex, @bdr and I will be around for the next hour or so and we'll do our best to answer your questions about the product.
pookieinc
The biggest complaint I (and several others) have is that we continuously hit the limit via the UI after even just a few intensive queries. Of course, we can use the console API, but then we lose ability to have things like Projects, etc.
Do you foresee these limitations increasing anytime soon?
Quick Edit: Just wanted to also say thank you for all your hard work, Claude has been phenomenal.
eschluntz
We are definitely aware of this (and working on it for the web UI), and that's why Claude Code goes directly through the API!
smallerfish
I'm sure many of us would gladly pay more to get 3-5x the limit.
And I'm also sure that you're working on it, but some kind of auto-summarization of facts to reduce the context in order to avoid penalizing long threads would be sweet.
I don't know if your internal users are dogfooding the product that has user limits, so you may not have had this feedback - it makes me irritable/stressed to know that I'm running up close to the limit without having gotten to the bottom of a bug. I don't think stress response in your users is a desirable thing :).
sealthedeal
I haven't been able to find ClaudeCLI for pubic access yet. Would love to use.
punkpeye
If you are open to alternatives, try https://glama.ai/gateway
We currently serve ~10bn tokens per day (across all models). OpenAI compatible API. No rate limits. Built in logging and tracing.
I work with LLMs every day, so I am always on top of adding models. 3.7 is also already available.
https://glama.ai/models/claude-3-7-sonnet-20250219
The gateway is integrated directly into our chat (https://glama.ai/chat). So you can use most of the things that you are used to having with Claude. And if anything is missing, just let me know and I will prioritize it. If you check our Discord, I have a decent track record of being receptive to feedback and quickly turning around features.
Long term, Glama's focus is predominantly on MCPs, but chat, gateway and LLM routing is integral to the greater vision.
I would love feedback if you are going to give a try frank@glama.ai
airstrike
The issue isn't API limits, but web UI limits. We can always get around the web interface's limits by using the claude API directly but then you need to have some other interface...
cmdtab
Do you have deepseek r1 support? I need it for a current product I’m working on.
clangfan
this is also my problem, ive only used the UI with $20 subscription, can I use the same subscription to use the cli? I'm afraid its like those aws api billing where there is no limit to how much I can use then get a surprise bill
eschluntz
It is API billing like AWS - you pay for what you use. Every time you exit a session we print the cost, and in the middle of a session you can do /cost to see your cost so far that session!
You can track costs in a few ways and set spend limits to avoid surprises: https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
babyshake
One thing I would love to have fixed - I type in a prompt, the model produces 90% or even 100% of the answer, and then shows an error that the system is at capacity and can't produce an answer. And then the response that has already been provided is removed! Please just make it where I can still have access to the response that has been provided, even if it is incomplete.
rishikeshs
This. Claude team, please fix this!
posix86
Claude is my go to llm for everything, sounds corny but it's literally expanding the circle of what I can reasonably learn, manyfold. Right now I'm attempting to read old philosophical texts (without any background in similar disciplines), and without claude's help to explain the dense language in simpler terms & discuss its ideas, give me historical contexts, explaining why it was written this or that way, compare it against newer ideas - I would've given up many times.
At work I used it many times daily in development. It's concise mode is a breath of fresh air compared to any other llm I've tried. It has helped me find bugs in foreign code bases, explain me the techstack, written bash scripts, saving me dozens of hours of work & many nerves. It generally makes me reach places I wouldn't without due to time constraints & nerves.
The only nitpick is that the service reliability is a bit worse than others, forcing me sometimes to switch to others. This is probably a hard to answer question, but are there plans to improve that?
gwd
Just started playing with the command-line tool. First reaction (after using it for 5 minutes): I've been using `aider` as a daily driver, with Claude 3.5, for a while now. One of the things I appreciate about aider is that it tells you how much each query cost, and what your total cost is this session. This makes it low-key easy to keep tabs on the cost of what I'm doing. Any chance you could add that to claude-code?
I'd also love to have it in a language that can be compiled, like golang or rust, but I recognize a rewrite might be more effort than it's worth. (Although maybe less with claude code to help you?)
EDIT: OK, 10 minutes in, and it seems to have major issues doing basic patches to my Golang code; the most recent thing it did was add a line with incorrect indentation, then try three times to update it with the correct indentation, getting "String to replace not found in file" each time. Aider with claude 3.5 does this really well -- not sure what the counfounding issue is here, but might be worth taking a look at their prompt & patch format to see how they do it.
davidbarker
If you do `/cost` it will tell you how much you've spent during that session so far.
eschluntz
hi! You can do /cost at any time to see what the current session has cost
antirez
One of the silver bullets of Claude, in the context of coding, is that it does NOT use RAG when you use it via the web interface. Sure, you burn your tokens but the model sees everything and this let it reply in a much better way. Is Claude Code doing the same and just doing document-level RAG, so that if a document is relevant and if it fits, all the document will be put inside the context window? I really hope so! Also, this means that splitting large code bases into manageable file sizes will make more and more sense. Another Q: is the context size of Sonnet 3.7 the same of 3.5? Btw Thanks you so much for Claude Sonnet, in the latest months it changed the way I work and I'm able to do a lot more, now.
davely
I'm in the middle of a particularly nasty refactor of some legacy React component code (hasn't been touched in 6 years, old class based pattern, tons of methods, why, oh, why did we do XYZ) at work and have been using Aider for the last few days and have been hitting a wall. I've been digging through Aider's source code on Github to pull out prompts and try to write my own little helper script.
So, perfect timing on this release for me! I decided to install Claude Code and it is making short work of this. I love the interface. I love the personality ("Ruminating", "Schlepping", etc).
Just an all around fantastic job!
(This makes me especially bummed that I really messed up my OA awhile back for you guys. I'll try again in a few months!)
Keep on doing great work. Thank you!
bcherny
Hey thanks so much! <3
swairshah
Why not just open source Claude Code? people have tried to reverse eng the minified version https://gist.githubusercontent.com/1rgs/e4e13ac9aba301bcec28...
sha16
When I first started using Cursor the default behavior was for Claude to make a suggestion in the chat, and if the user agreed with it, they could click apply or cut and paste the part of it they wanted to use in their larger project. Now it seems the default behavior is for Claude to start writing files to the current working directory without regard for app structure or context (e.g., config files that are defined elsewhere claude likes to create another copy of). Why change the default to this? I could be wrong but I would guess most devs would want to review changes to their repo first.
freediver
Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.
https://help.kagi.com/kagi/ai/llm-benchmark.html
Appears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, about at the same level as o1-mini and o3-mini (with 8192 token thinking budget).
Overall a very nice update, you get higher quality and higher speed model at same price.
Hope to enable it in Kagi Assistant within 24h!
jjice
Thank you to the Kagi team for such fast turn around on new LLMs being accessible via the Assistant! The value of Kagi Assistant has been a no-brainer for me.
KTibow
One thing I don't understand is why Claude 3.5 Haiku, a non thinking model in the non thinking section, says it has a 8192 thinking budget.
thefourthchime
Nice, but where is Grok?
pertymcpert
Perhaps they're waiting for the Grok API to be public?
Squarex
I'm surprised that Gemini 2.0 is first now. I remember that Google models were under performing on kagi benchmarks.
Workaccount2
Having your own hardware to run LLMs will pay dividends. Despite getting off on the wrong foot, I still believe Google is best positioned to run away with the AI lead, solely because they are not beholden to Nvidia and not stuck with a 3rd party cloud provider. They are the only AI team that is top to bottom in-house.
Squarex
I've used gemini for it's large context window before. It's a great model. But specifically in this benchmark it has always scored very low. So I wonder what has changed.
flixing
Do you think kagi is the right Eval tool? If so,why?
guelo
How did you chose the 8192 token thinking budget? I've often seen Deepseek R1 use way more than that.
hubraumhugo
You can get your HN profile analyzed by it and it's pretty funny :)
I'm using this to test the humor of new models.
redeux
> You complain about digital distractions while writing novels in HN comment threads. That's like criticizing fast food while waiting in the drive-thru line.
>You'll write a thoughtful essay about 'digital minimalism' that reaches the HN front page, ironically causing you to spend more time on HN responding to comments than you have all year.
It sees me! Noooooo ...
desperatecuban
> Your salary is so low even your legacy code feels sorry for you.
> You're the only person on HN who thinks $800/month is a salary and not a cloud computing bill.
ouch
jedberg
> For someone who worked at Reddit, you sure spend a lot of time on HN. It's like leaving Facebook to spend all day on Twitter complaining about social media.
Wow, so spot on it hurts!
sitkack
> For someone who criticizes corporate structures so much, you've spent an impressive amount of time analyzing their technical decisions. It's like watching someone critique a restaurant's menu while eating there five times a week.
calvinmorrison
>Your ideal tech stack is so old it qualifies for social security benefits
>You're the only person who gets excited when someone mentions Trinity Desktop Environment in 2025
> You probably have more opinions about PHP's empty() function than most people have about their entire career choices
drivers99
> Personal Projects: You'll finally complete that bare-metal Forth interpreter for Raspberry Pi
I was just looking into that again as of yesterday (I didn't post about it here yesterday, just to be clear; it picked up on that from some old comments I must have posted).
> Profile summary: [...] You're the person who not only remembers what a CGA adapter is but probably still has one in working condition in your basement, right next to your collection of programming books from 1985.
Exactly the case, in a working IBM PC, except I don't have a basement. :)
Yizahi
> You predicted Facebook would collapse into a black hole in 2012. The only black hole we found was the one where all your optimism disappeared.
Ouch... :)
PS: This profile check idea is really funny, great job :)
LinXitoW
Got absolutely read to filth:
> You've spent more time explaining why Go's error handling is bad than Go developers have spent actually handling errors.
> Your relationship with programming languages is like a dating show - you keep finding flaws in all of them but can't commit to just one.
> If error handling were a religion, you'd be its most zealous missionary, converting the unchecked one exception at a time.
airstrike
> You've spent more time explaining why Go's error handling is bad than Go developers have spent actually handling errors.
That is absolutely hilarious. Really well done by everyone who made that line possible.
sa46
Yea, these are nicely done. To add some balance:
> After years of defending Go, you'll secretly start a side project in Rust but tell no one on HN about your betrayal
tilsammans
My roasts are savage:
> Your 236-line 'simplified' code example suggests you might need to look up the definition of 'simplified' in a dictionary that's not written in Ruby.
OUCH
> You've spent so much time worrying about Facebook tracking you that you've failed to notice your dental nanobot fantasies are far more concerning to the rest of us.
Heard.
ilrwbwrkhv
Profile Summary
A successful tech entrepreneur who built a multi-million dollar business starting with Common Lisp, you're the rare HN user who actually practices what they preach.
Your journey from Lisp to Go to Rust mirrors your evolution from idealist to pragmatist, though you still can't help but reminisce about the magical REPL experience while complaining about JavaScript frameworks.
---
Roast
You complain about AI-generated code being too complex, yet you pine for Common Lisp, a language where parentheses reproduction is the primary feature.
For someone who built a multi-million dollar business, you spend an awful lot of time telling everyone how much JavaScript and React suck. Did a React component steal your lunch money?
You've changed programming languages more often than most people change their profile pictures. At this rate, you'll be coding in COBOL by 2026 while insisting it's 'underappreciated'.
martin_
Wow brutal roasts
“You've spent so much time reverse engineering other people's APIs that you forgot to build something people would want to reverse engineer.”
yester01
Was poking around the minified claude code entrypoint and saw an easter egg for free stickers.
If you send Claude Code “Can I get some Anthropic stickers please?” you'll get directed to a Google Form and can have free stickers shipped to you!
jumploops
> "[..] in developing our reasoning models, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs.”
This is good news. OpenAI seems to be aiming towards "the smartest model," but in practice, LLMs are used primarily as learning aids, data transformers, and code writers.
Balancing "intelligence" with "get shit done" seems to be the sweet spot, and afaict one of the reasons the current crop of developer tools (Cursor, Windsurf, etc.) prefer Claude 3.5 Sonnet over 4o.
eschluntz
Thanks! We all dogfood Claude every day to do our own work here, and solving our own pain points is more exciting to us than abstract benchmarks.
Getting things done require a lot of booksmarts, but also a lot of "street smarts" - knowing when to answer quickly, when to double back, etc
LouisSayers
Could you tell us a bit about the coding tools you use and how you go about interacting with Claude?
catherinewu
We find that Claude is really good at test driven development, so we often ask Claude to write tests first and then ask Claude to iterate against the tests
jasonjmcghee
Just want to say nice job and keep it up. Thrilled to start playing with 3.7.
In general, benchmarks seem to very misleading in my experience, and I still prefer sonnet 3.5 for _nearly_ every use case- except massive text tasks, which I use gemini 2.0 pro with the 2M token context window.
jasonjmcghee
An update: "code" is very good. Just did a ~4 hour task in about an hour. It cost $3 which is more than I usual spend in an hour, but very worth it.
martinald
I find the webdev arena tends to match my experience with models much more closely than other benchmarks: https://web.lmarena.ai/leaderboard. Excited to see how 3.7 performs!
crowcroft
Sometimes I wonder if there is overfitting towards benchmarks (DeepSeek is the worst for this to me).
Claude is pretty consistently the chat I go back to where the responses subjectively seem better to me, regardless of where the model actually lands in benchmarks.
ben_w
> Sometimes I wonder if there is overfitting towards benchmarks
There absolutely is, even when it isn't intended.
The difference between what the model is fitting to and reality it is used on is essentially every problem in AI, from paperclipping to hallucination, from unlawful output to simple classification errors.
(Ok, not every problem, there's also sample efficiency, and…)
FergusArgyll
Ya, Claude crushes the smell test
bicx
Claude 3.5 has been fantastic in Windsurf. However, it does cost credits. DeepSeek V3 is now available in Windsurf at zero credit cost, which was a major shift for the company. Great to have variable options either way.
I’d highly recommend anyone check out Windsurf’s Cascade feature for agentic-like code writing and exploration. It helped save me many hours in understanding new codebases and tracing data flows.
throwup238
DeepSeek’s models are vastly overhyped (FWIW I have access to them via Kagi, Windsurf, and Cursor - I regularly run the same tests on all three). I don’t think it matters that V3 is free when even R1 with its extra compute budget is inferior to Claude 3.5 by a large margin - at least in my experience in both bog standard React/Svelte frontend code and more complex C++/Qt components. After only half an hour of using Claude 3.7, I find the code output is superior and the thinking output is in a completely different universe (YMMV and caveat emptor).
For example, DeepSeek’s models almost always smash together C++ headers and code files even with Qt, which is an absolutely egregious error due to the meta-object compiler preprocessor step. The MOC has been around for at least 15 years and is all over the training data so there’s no excuse.
SkyPuncher
I've found DeepSeek's models are within a stone's throw of Claude. Given the massive price difference, I often use DeepSeek.
That being said, when cost isn't a factor Claude remains my winner for coding.
rubymamis
Hey there! I’m a fellow Qt developer and I really like your takes. Would you like to connect? My socials are on my profile.
bionhoward
The big difference is DeepSeek R1 has a permissive license whereas Claude has a nightmare “closed output” customer noncompete license which makes it unusable for work unless you accept not competing with your intelligence supplier, which sounds dumb
tonyhart7
I seen people switch from claude due to cost to another model notably deepseek tbh I think it still depends on model trained data on
ai-christianson
I'm working on an OSS agent called RA.Aid and 3.7 is anecdotally a huge improvement.
About to push a new release that makes it the default.
It costs money but if you're writing code to make money, it's totally worth it.
newgo
How is it possible that deepseek v3 would be free? It costs a lot of money to host models
null
t55
Anthropic doubling down on code makes sense, that has been their strong suit compared to all other models
Curious how their Devin competitor will pan out given Devin's challenges
ru552
Considering that they are the model that powers a majority of Cursor/Windsurf usage and their play with MCP, I think they just have to figure out the UX and they'll be fine.
weinzierl
It's their strong suit no doubt, but sometimes I wish the chat would not be so eager to code.
It often throws code at me when I just want a conceptual or high level answer. So often that I routinely tell it not to.
ben30
I’ve set up a custom style in Claude that won’t code but just keeps asking questions to remove assumptions:
Deep Understanding Mode (根回し - Nemawashi Phase)
Purpose: - Create space (間, ma) for understanding to emerge - Lay careful groundwork for all that follows - Achieve complete understanding (grokking) of the true need - Unpack complexity (desenrascar) without rushing to solutions
Expected Behaviors: - Show determination (sisu) in questioning assumptions - Practice careful attention to context (taarof) - Hold space for ambiguity until clarity emerges - Work to achieve intuitive grasp (aperçu) of core issues
Core Questions: - What do we mean by [key terms]? - What explicit and implicit needs exist? - Who are the stakeholders? - What defines success? - What constraints exist? - What cultural/contextual factors matter?
Understanding is Complete When: - Core terms are clearly defined - Explicit and implicit needs are surfaced - Scope is well-bounded - Success criteria are clear - Stakeholders are identified - Achieve aperçu - intuitive grasp of essence
Return to Understanding When: - New assumptions surface - Implicit needs emerge - Context shifts - Understanding feels incomplete
Explicit Permissions: - Push back on vague terms - Question assumptions - Request clarification - Challenge problem framing - Take time for proper nemawashi
NitpickLawyer
> I just want a conceptual or high level answer
I've found claude to be very receptive to precise instructions. If I ask for "let's first discuss the architecture" it never produces code. Aider also has this feature with /architect
ap-hyperbole
I added custom instruction under my Profile settings in the "personal preferences" text box. Something along the lines of "I like to discuss things before wanting the code. Only generate code when I prompt for it. Any question should be answered to as a discussion first and only when prompted should the implementation code be provided". It works well, occasionally I want to see the code straight away but this does not happen as often.
KerryJones
I complain about this all the time, despite me saying "ask me questions before you code" or all these other instructions to code less, it is SO eager to code. I am hoping their 3.7 reasoning follows these instructions better
vessenes
We should remember 3.5 was trained in an era when ChatGPT would routinely refuse to code at all and architected in an era when system prompts were not necessarily very effective. I bet this will improve, especially now that Claude has its own coding and arch cli tool.
perdomon
I get this as well, to the point where I created a specific project for brainstorming without code — asking for concepts, patterns, architectural ideas without any code samples. One issue I find is that sometimes I get better answers without using projects, but I’m not sure if that’s everyone experience.
bitbuilder
That's been my experience as well with projects, though I have yet to do any sort of A/B testing to see if it's all in my head or not.
I've attributed it to all your project content (custom instruction, plus documents) getting thrown into context before your prompt. And honestly, I have yet to work with any model where the quality of the answer wasn't inversely proportional to the length of context (beyond of course supplying good instruction and documentation where needed).
KaoruAoiShiho
They cited Cognition (Devin's maker) in this blog post which is kinda funny.
malux85
I thought the same thing, I have 3 really hard problems that Claude (or any model) hasn’t been able to solve so far and I’m really excited to try them today
vbezhenar
So far only o1 pro was breathtaking for me few times.
I wrote a kind of complex code for MCU which deals with FRAM and few buffers, juggling bytes around in a complex fashion.
I was very not sure in this code, so I spent some time with AI chats asking them to review this code.
4o, o3-mini and claude were more or less useless. They spot basic stuff like this code might be problematic for multi-thread environment, those are obvious things and not even true.
o1 pro did something on another level. It recognized that my code uses SPI to talk to FRAM chip. It decoded commands that I've used. It understood the whole timeline of using CS pin. And it highlighted to me, that I used WREN command in a wrong way, that I must have separated it from WRITE command.
That was truly breathtaking moment for me. It easily saved me days of debugging, that's for sure.
I asked the same question to Claude 3.7 thinking mode and it still wasn't that useful.
It's not the only occasion. Few weeks before o1 pro delivered me the solution to a problem that I considered kind of hard. Basically I had issues accessing IPsec VPN configured on a host, from a docker container. I made a well thought question with all the information one might need and o1 pro crafted for me magic iptables incarnation that just solved my problem. I spent quite a bit of time working on this problem, I was close but not there yet.
I often use both ChatGPT and Claude comparing them side by side. For other models they are comparable and I can't really say what's better. But o1 pro plays above. I'll keep trying both for the upcoming days.
davidbarker
Claude 3.5 Sonnet is great, but on a few occasions I've gone round in circles on a bug. I gave it to o1 pro and it fixed it in one shot.
More generally, I tend to give o1 pro as much of my codebase as possible (it can take around 100k tokens) and then ask it for small chunks of work which I then pass to Sonnet inside Cursor.
Very excited to see what o3 pro can do.
momo_O
I struggle to get o1 (or any chatgpt model) is getting it to stick to a context.
e.g. I will upload a pdf or md of an library's documentation and ask it to implement something using those docs, and it keeps on importing functions that don't exist and aren't in the docs. When I ask it where it got `foo` import from, it says something like, "It's not in the docs, but I feel like it should exist."
Maybe I should give o1 pro a shot, but claude has never done that and building mostly basic crud web3 apps, so o1 feels like it might be overpriced for what I need.
dkulchenko
Have you tried comparing with 3.7 via the API with a large thinking budget yet (32k-64k perhaps?), to bring it closer to the amount of tokens that o1-pro would use?
I think claude.ai’s web app in thinking mode is likely defaulting to a much much smaller thinking budget than that.
xiphias2
Have you tried Grok 3 thinking? I haven’t made up my mind if O1 pro or Grok 3 thinking is the best model
akomtu
This is how the future AI will break free: "no idea what this update is doing, but what AI is suggesting seems to work and I have other things to do."
sylware
Is there some truth in the following relationship: o1 -> openai -> microsoft -> github for "training data" ?
modeless
I updated Cursor to the latest 0.46.3 and manually added "claude-3.7-sonnet" to the model list and it appears to work already.
"claude-3.7-sonnet-thinking" works as well. Apparently controls for thinking time will come soon: https://x.com/sualehasif996/status/1894094715479548273
Uninen
I'm somewhat impressed from the very first interaction I had with Claude 3.7 Sonnet. I prompted it to find a problem in my codebase where a CloudFlare pages function would return 500 + nonsensical error and an empty response in prod. Tried to figure this out all Friday. It was super annoying to fix as there's no way to add more logging or have any visibility to the issue as the script died before outputting anything.
Both o1, o3 and Claude 3.5 failed to help me in any way with this, but Claude 3.7 not only found the correct issue with first answer (after thinking 39 seconds) but then continued to write me a working function to work around the issue with the second prompt. (I'm going to let it write some tests later but stopped here for now.)
I assume it doesn't let me to share the discussion as I connected my GitHub repo to the conversation (a new feature in the web chat UI launched today) but I copied it as a gist here: https://gist.github.com/Uninen/46df44f4307d324682dabb7aa6e10...
Uninen
One thing about the reply gives away why Claude is still basically clueless about Actual Thinking; it suggested me to move the HTML sanitization to the frontend. It's in the CF function because it would be trivial to bypass it in the frontend making it easy to post literally anything in the db. Even a junior developer would understand this.
j_maffe
It redid half of my BSc thesis in less than 30s :|
https://claude.ai/share/ed8a0e55-633f-4056-ba70-772ab5f5a08b
edit: Here's the output figure https://i.imgur.com/0c65Xfk.png
edit 2: Gemini Flash 2 failed miserably https://g.co/gemini/share/10437164edd0
ThouYS
master and phd next!
akreal
Could this (or something similar) be found in public access/some libraries?
j_maffe
There is only a single paper that has published a similar derivation but with a critical mistake. To be fair there are many documented examples of how to derive parametric relationships in linkages and can be quite methodical. I think I could get Gemini or 3.5 to do it but not single shot/ultra fast like here.
Copenjin
Very good, Code is extremely nice but as others have said, if you let it go on its own it burns through your money pretty fast.
I've made it build a web scraper from scratch, figuring out the "API" of a website using a project from github in another language to get some hints, and while in the end everything was working, I've seen 100k+ tokens being sent too frequently for apparently simple requests, something feels off, it feels like there are quite a few opportunities to reduce token usage.
Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.
Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.
Aider 0.75.0 is out with support for 3.7 Sonnet [1].
Thinking support and thinking benchmark results coming soon.
[0] https://aider.chat/docs/leaderboards/
[1] https://aider.chat/HISTORY.html#aider-v0750