Claude 3.7 Sonnet and Claude Code

1013 comments

·February 24, 2025

Visit

anotherpaulg

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

nightpool

> 225 coding exercises from Exercism

Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?

anotherpaulg

I try not to let perfect be the enemy of good. All benchmarks have limitations.

The Exercism problems have proven to be very effective at measuring an LLM's ability to modify existing code. I receive a lot of feedback that the aider benchmarks correlate strongly with people's "vibes" on model coding skill. I agree. The scores have felt quite aligned with my hands-on experience coding with most of the top models over the last 18+ months.

To be clear, the purpose of the benchmark is to help me quantitatively assess and improve aider and make it more effective. But it's also turned out to be a great way to measure the coding skill of LLMs.

guccihat

> The Exercism problems have proven to be very effective at measuring an LLM's ability to modify existing code

The Aider Polyglot website also states that the benchmark " ...asks the LLM to edit source files to complete 225 coding exercises".

However, when looking at the actual tests [0], it is not about editing code bases, it's rather just solving simple programming exercies? What am I missing?

[0] https://github.com/Aider-AI/polyglot-benchmark

jrflowers

>I try not to let perfect be the enemy of good. All benchmarks have limitations.

Overfitting is one of the fundamental issues to contend with when trying to figure out if any type of model at all is useful. If your leaderboard corresponds to vibes and that is your target, you could just have a vibes leaderboard

rodrigodlu

That's my perception as well. Most of the time, most of the devs I know, including myself, are not really creating novelty with the code itself, but with the product. (Sometimes even the product is not novel, just a similar enhanced version of existing products)

If the resulting code is not trying to be excessively clever or creative this is actually a good thing in my book.

The novelty and creativity should come from the product itself, especially from the users/customers perspective. Some people are too attached to LLM leaderboards being about novelty. I want reliable results whenever I give the instructions, either be the code, or the specs built into a spec file after throwing some ideas into prompts.

Marazan

Having the verbatim answer to the test is not a "limitation" it is an invalidation.

szundi

[dead]

jonplackett

I like to make up my own tests, that way you know it is actually thinking.

Tests that require thinking about the physical world are the most revealing.

My new favourite is:

You have 2 minutes to cool down a cup of coffee to the lowest temp you can.

You have two options: 1. Add cold milk immediately, then let it sit for 2 mins.

2. Let it sit for 2 mins, then add cold milk.

Which one cools the coffee to the lowest temperature and why?

Phrased this way without any help, all but the thinking models get it wrong

danbruc

No need for thinking, that question can be found discussed and explained many times online and has almost certainly been part of the training data.

akoboldfrying

The fact that the answer is interesting makes me suspect that it's not a good test for thinking. I remember reading the explanation for the answer somewhere on the internet years ago, and it's stayed with me ever since. It's interesting enough that it's probably been written about multiple times in multiple places. So I think it would probably stay with a transformer trained on large volumes of data from the internet too.

I think a better test of thinking is to provide detail about something so mundane and esoteric that no one would have ever thought to communicate it to other people for entertainment, and then ask it a question about that pile of boring details.

ur-whale

> I like to make up my own tests

You just ruined your own test by publishing it on the internets

pythonaut_16

I’m not sure how much this tells me about a model’s coding ability though.

It might correlate to design level thinking but it also might not.

vintermann

I have another easy one which thinking models get wrong:

"Anhentafel numbers start with you as 1. To find the Ahhentafel number of someone's father, double it. To find the Ahnentafel number of someone's mother, double it and add one.

Men pass on X chromosome DNA to their daughters, but none to their sons. Women pass on X chromosome DNA to both their sons and daughters.

List the Ahnentafel numbers of the closest 20 ancestors a man may have inherited X DNA from."

For smaller models, it's probably fair to change the question to something like: "Could you have inherited X chromosome DNA from your ancestor with Ahnentafel number 33? Does the answer to that question depend on whether you are a man or a woman?" They still fail.

freehorse

I asked this to QwQ and it started writing equations (newton's law) and arrived at T_2 < T_1, then said this is counterintuitive, started writing more equations and arrived to the same, starts writing an explanation on why this is indeed the case instead of what it is intuitive, and concludes to the right answer.

It is the only model I gave this and actually approached it by writing math. Usually I am not that impressed with reasoning models, but this was quite fun to watch.

mrcwinn

Obviously you would prepare cold brew the night before.

atlex2

Yes absolutely this! We're working on these problems at FlyShirley for our pilot training tool. My go-to is: I'm facing 160 degrees and want to face north. What's the quickest way to turn and by how much?

For small models and when attention is "taken up", these sorts of questions really send a model for a loop. Agreed - especially noticeable with small reasoning models.

chvid

They leak the second they are used on a model behind an API, don't they?

chvid

As far as I can tell the only way of doing a comparison of two models, that cannot be easily gamed, is being having them in open weights form and then running them against a benchmark that was created after both of the two models were created.

anotherpaulg

Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5

mikae1

It's clear that progress is incremental at this point. At the same time Anthropic and OpenAI are bleeding money.

It's unclear to me how they'll shift to making money while providing almost no enhanced value.

khafra

Yudkowsky just mentioned that even if LLM progress stopped right here, right now, there are enough fundamental economic changes to provide us a really weird decade. Even with no moat, if the labs are in any way placed to capture a little of the value they've created, they could make high multiples of their investors' money.

vessenes

Paul, I saw in the notes that using claude with thinking mode requires yml config updates -- any pointers here? I was parsing some commits, and I couldn't tell if you only added architect support through openrouter. Thanks!

anotherpaulg

Here are the current docs for changing the thinking token limits.

https://aider.chat/docs/llms/anthropic.html#thinking-tokens

I'll make this less clunky soon.

pclmulqdq

Also for $36.83 compared to o1's $186.50

pzo

But also for $36.83 compared to DeepSeek R1 + claude-3-5 it's $13.29 and for latter "Percent using correct edit format" is 100% vs 97.8% for 3.7.

edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.

VectorLock

How does it stack up against Grok3? I've seen some discussion that Grok3 is good for coding.

viraptor

It isn't available over api yet, as far as I know. So it can't be really tested independently.

pclmulqdq

Pro tip: It's hard to trust Twitter for opinions on Grok. The thumb is very clearly on the scale. I have personally seen very few positive opinions of Grok outside of Twitter.

gwd

Interesting that the "correct diff format" score went from 99.6% with Claude 3.5 to 93.3% for Claude 3.7. My experience with using claude-code was that it consistently required several tries to get the right diff. Hopefully all that will improve as they get things ironed out.

macNchz

Reasoning models pretty reliably seem to do worse at exacting output formats/structured outputs—so far with Aider it has been an effective strategy to employ o1 to “think” about the issue at hand, and have Sonnet implement. Interested to try various approaches with 3.7 in various combinations of reasoning effort.

bugglebeetle

It’s funny because I also have found myself doing this exact with R1+Sonnet 3.5 recently. Windsurf allows you to do a chat mode exchange with one model and then switch to another to implement. The reasoning models all seem pretty poorly implemented for the agentic workflows, but work well when paired with Claude.

WatchDog

3.7 completed a lot more than 3.5, without seeing the actual results, we can't tell if there were any regressions in the edit format among the previously completed tasks.

Sterling9x

That's a file context problem because you use cursor or cline or some other crap context maker. Try Clood.

Unless "anthropic high usage" which I just watch the incident reports I one shot features regularly.

At a high skill level. Not front end. Back end c# in a small but great framework that has poor documentation. Not just endpoints but full on task queues.

So really, it's a context problem. You're just not laser focusing your context.

Try this:

Set up a context with the exact files needed. Sure ai "should" do that but it doesn't. Especially not cursor or cline. Then try.

Hell try it with clood after I update with 3.7. I bet you, if you clood file it, then you get one shots.

I have a long history of clood being a commit in my projects and it's a clood one shot.

nuancebydefault

Ah, the issue is contextual flux in your Clood-Cline stack. Just quantum defrag the file vectors, reverse-polarize the delta stream, and inject a neural bypass. If that fails, reboot the universe. One-shot cloodfile guaranteed.

DonHopkins

Have you tried running a level 1 diagnostic on the subspace bypass?

rudedogg

Wtf is “clood”?

billmalarky

Hi Paul, been following the aider project for about a year now to develop an understanding of how to build SWE agents.

I was at the AI Engineering Summit in NYC last week and met an (extremely senior) staff ai engineer doing somewhat unbelievable things with aider. Shocking things tbh.

Is there a good way to share stories about real-world aider projects like this with you directly (if I can get approval from him)? Not sure posting on public forum is appropriate but I think you would be really interested to hear how people are using this tool at the edge.

tecleandor

Hope it gets to be public, I love to learn "weird" (or unusual) ways of using tools

bearjaws

Thanks for all the work on aider, my favorite AI tool.

bt1a

It really is best in slot. Owe it to git, which has a particular synergy with a hallucination-prone but correctable system

doctoboggan

I like Aider but I've turned off auto-commit. I just can't seem to let the AI actually commit code for me. Do you regularly let Aider commit for you? How much do you review the code written by it?

doctoboggan

Have you tried Claude 3.7 + Deepseek as the architect? Seeing as "DeepSeek R1 + claude-3-5-sonnet-20241022" is the second place option, "DeepSeek R1 + claude-3-7" would hopefully be the highest ranking choice so far?

SparkyMcUnicorn

It looks like Sonnet 3.7 (extended thinking) would be a better architect than R1.

I'll be trying out Sonnet 3.7 extended thinking + Sonnet 3.5 or Flash 2.0, which I assume would be at the top of the leaderboard.

attentive

given 3.5 and 3.7 cost the same, it doesn't make sense to use 3.5 here.

I'd like to see that benchmark, but R1 + 3.7 should be cheaper than 3.7T + 3.7

sheepdestroyer

Nice !

Could we please get benchmarks for architect / DeepSeek R1 + claude-3-7-20250219 ?

To compare perf and price with Sonnet-3.7-thinking

stavros

I'd like to second the thanks for Aider, I use it all the time.

bcherny

Hi everyone! Boris from the Claude Code team here. @eschluntz, @catherinewu, @wolffiex, @bdr and I will be around for the next hour or so and we'll do our best to answer your questions about the product.

babyshake

One thing I would love to have fixed - I type in a prompt, the model produces 90% or even 100% of the answer, and then shows an error that the system is at capacity and can't produce an answer. And then the response that has already been provided is removed! Please just make it where I can still have access to the response that has been provided, even if it is incomplete.

rishikeshs

This. Claude team, please fix this!

cat-snatcher

The UX team would never allow it. You gotta stay minimal and and definitely can't have any acknowledgement that a non-ideal user experience exists.

throwaway454647

I'll be publishing a Firefox extension as a temporary fix, will post it here. (I don't use Chrome.)

ZeroTalent

I think tampermonkey code is a better solution?

throwaway454647

I've made the extension, but I haven't been able to test it (hence I'd rather not release it). I use Claude daily, but I haven't bumped into the situation yet where the generated output would disappear.

Imustaskforhelp

Yup. Its a great issue which messes like , cmon you were there at the last line.

srigi

To me it doesn’t look like a bug. I believe it is a intended “feature” pushed from high management - a dark patern to make plebs pay for answer that has overflowed the quota.

allpratik

Plus one for this.

pookieinc

The biggest complaint I (and several others) have is that we continuously hit the limit via the UI after even just a few intensive queries. Of course, we can use the console API, but then we lose ability to have things like Projects, etc.

Do you foresee these limitations increasing anytime soon?

Quick Edit: Just wanted to also say thank you for all your hard work, Claude has been phenomenal.

eschluntz

We are definitely aware of this (and working on it for the web UI), and that's why Claude Code goes directly through the API!

smallerfish

I'm sure many of us would gladly pay more to get 3-5x the limit.

And I'm also sure that you're working on it, but some kind of auto-summarization of facts to reduce the context in order to avoid penalizing long threads would be sweet.

I don't know if your internal users are dogfooding the product that has user limits, so you may not have had this feedback - it makes me irritable/stressed to know that I'm running up close to the limit without having gotten to the bottom of a bug. I don't think stress response in your users is a desirable thing :).

raylad

The problem with the API is that it, as it says in the documentation, could cost $100/hr.

I would pay $50/mo or something to be able to have reasonable use of Claude Code in a limited (but not as limited) way as through the web UI, but all of these coding tools seem to work only with the API and are therefore either too expensive or too limited.

sealthedeal

I haven't been able to find ClaudeCLI for pubic access yet. Would love to use.

mianos

I paid for it for a while, but I kept running out of usage limits right in the middle of work every day. I'd end up pasting the context into ChatGPT to continue. It was so frustrating, especially because I really liked it and used it a lot.

It became such an anti-pattern that I stopped paying. Now, when people ask me which one to use, I always say I like Claude more than others, but I don’t recommend using it in a professional setting.

zaptrem

I have substantial usage via their API using LibreChat and have never run into rate limits. Why not just use that?

divan

Same.

punkpeye

If you are open to alternatives, try https://glama.ai/gateway

We currently serve ~10bn tokens per day (across all models). OpenAI compatible API. No rate limits. Built in logging and tracing.

I work with LLMs every day, so I am always on top of adding models. 3.7 is also already available.

https://glama.ai/models/claude-3-7-sonnet-20250219

The gateway is integrated directly into our chat (https://glama.ai/chat). So you can use most of the things that you are used to having with Claude. And if anything is missing, just let me know and I will prioritize it. If you check our Discord, I have a decent track record of being receptive to feedback and quickly turning around features.

Long term, Glama's focus is predominantly on MCPs, but chat, gateway and LLM routing is integral to the greater vision.

I would love feedback if you are going to give a try frank@glama.ai

airstrike

The issue isn't API limits, but web UI limits. We can always get around the web interface's limits by using the claude API directly but then you need to have some other interface...

thrdbndndn

Just tried it, is there a reason why the webUI is so slow?

Try to delete (close) the panel on the right on a side-by-side view. It took a good second to actually close. Creating one isn't much faster.

This is unbearably slow, to be blurt.

tesch1

Who is glama.ai though? Could not find company info on the site, the Frank name writing the blog posts seems to be an alias for Popeye the sailor. Am I missing something there? How can a user vet the company?

cmdtab

Do you have deepseek r1 support? I need it for a current product I’m working on.

Daniel_Van_Zant

I see Cohere, is there any support for in-line citations like you can get with their first party API?

clangfan

this is also my problem, ive only used the UI with $20 subscription, can I use the same subscription to use the cli? I'm afraid its like those aws api billing where there is no limit to how much I can use then get a surprise bill

eschluntz

It is API billing like AWS - you pay for what you use. Every time you exit a session we print the cost, and in the middle of a session you can do /cost to see your cost so far that session!

You can track costs in a few ways and set spend limits to avoid surprises: https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...

edmundsauto

I use AnythingLLM so you can still have a "Projects" like RAG.

posix86

Claude is my go to llm for everything, sounds corny but it's literally expanding the circle of what I can reasonably learn, manyfold. Right now I'm attempting to read old philosophical texts (without any background in similar disciplines), and without claude's help to explain the dense language in simpler terms & discuss its ideas, give me historical contexts, explaining why it was written this or that way, compare it against newer ideas - I would've given up many times.

At work I used it many times daily in development. It's concise mode is a breath of fresh air compared to any other llm I've tried. It has helped me find bugs in foreign code bases, explain me the techstack, written bash scripts, saving me dozens of hours of work & many nerves. It generally makes me reach places I wouldn't without due to time constraints & nerves.

The only nitpick is that the service reliability is a bit worse than others, forcing me sometimes to switch to others. This is probably a hard to answer question, but are there plans to improve that?

davely

I'm in the middle of a particularly nasty refactor of some legacy React component code (hasn't been touched in 6 years, old class based pattern, tons of methods, why, oh, why did we do XYZ) at work and have been using Aider for the last few days and have been hitting a wall. I've been digging through Aider's source code on Github to pull out prompts and try to write my own little helper script.

So, perfect timing on this release for me! I decided to install Claude Code and it is making short work of this. I love the interface. I love the personality ("Ruminating", "Schlepping", etc).

Just an all around fantastic job!

(This makes me especially bummed that I really messed up my OA awhile back for you guys. I'll try again in a few months!)

Keep on doing great work. Thank you!

bcherny

Hey thanks so much! <3

gwd

Just started playing with the command-line tool. First reaction (after using it for 5 minutes): I've been using `aider` as a daily driver, with Claude 3.5, for a while now. One of the things I appreciate about aider is that it tells you how much each query cost, and what your total cost is this session. This makes it low-key easy to keep tabs on the cost of what I'm doing. Any chance you could add that to claude-code?

I'd also love to have it in a language that can be compiled, like golang or rust, but I recognize a rewrite might be more effort than it's worth. (Although maybe less with claude code to help you?)

EDIT: OK, 10 minutes in, and it seems to have major issues doing basic patches to my Golang code; the most recent thing it did was add a line with incorrect indentation, then try three times to update it with the correct indentation, getting "String to replace not found in file" each time. Aider with claude 3.5 does this really well -- not sure what the counfounding issue is here, but might be worth taking a look at their prompt & patch format to see how they do it.

davidbarker

If you do `/cost` it will tell you how much you've spent during that session so far.

eschluntz

hi! You can do /cost at any time to see what the current session has cost

antirez

One of the silver bullets of Claude, in the context of coding, is that it does NOT use RAG when you use it via the web interface. Sure, you burn your tokens but the model sees everything and this let it reply in a much better way. Is Claude Code doing the same and just doing document-level RAG, so that if a document is relevant and if it fits, all the document will be put inside the context window? I really hope so! Also, this means that splitting large code bases into manageable file sizes will make more and more sense. Another Q: is the context size of Sonnet 3.7 the same of 3.5? Btw Thanks you so much for Claude Sonnet, in the latest months it changed the way I work and I'm able to do a lot more, now.

bcherny

Right -- Claude Code doesn't use RAG currently. In our testing we found that agentic search out-performed RAG for the kinds of things people use Code for.

marlott

Interesting - can you elaborate a little on what you mean by agentic search here?

danso

Been a long time casual — i.e. happy to fix my code by asking questions and copy/pasting individual snippets via the chat interface. Decided to give the `claude` terminal tool a run and have to admit it looks like a fantastic tool.

Haven't tried to build a modern JS web app in years — it took the claude tool just a few minutes of prompting to convert and refactor an old clunky tool into a proper project structure, and using svelte and vite and tailwind (which I haven't built with before). Trying to learn how to even scaffold a modern app has felt daunting and this eliminates 99% of that friction.

One funny quirk: I asked it to build a test suite (I know zilch about JS testing frameworks, so it picked vitest for me) for the newly refactored app. I noticed that 3 of the 20 tests failed and so I asked it to run vitest for itself and fix the failing things. 2 minutes later, and now 7 tests were failing...

Which is very funny to me, but also not a big deal. Again, it's such a chore to research test libs and then set things up to their conventions. That the claude tool built a very usable scaffold that I can then edit and iterate on is such a huge benefit by itself, I don't need (nor desire) the AI to be complete turnkey solution.

fsndz

Anthropic is back and cementing its place as the creator of the best coding models—bravo!

With Claude Code, the goal is clearly to take a slice of Cursor and its competitors' market share. I expected this to happen eventually.

The app layer has barely any moat, so any successful app with the potential to generate significant revenue will eventually be absorbed by foundation model companies in their quest for growth and profits.

keithwhor

I think an argument could be reasonably made that the app layer is the only moat. It’s more likely Anthropic eventually has to acquire Cursor to cement a position here than they out-compete it. Where, why, what brand and what product customers swipe their credit cards for matters — a lot.

fsndz

if Claude Code offers a better experience, users will rapidly move from cursor to Claude Code.

Claude is for Code: https://medium.com/thoughts-on-machine-learning/claude-is-fo...

neal_

Cursor has no models, they dont even have an editor its just vscode

biker142541

I wonder if they will offer competitive request counts against Cursor. Right now, at least for me, the biggest downside to Claude is how fast I blow through the limits (Pro) and hit a wall.

At least with Cursor, I can use all "premium" 500 completions and either buy more, or be patient for throttled responses.

biker142541

Reread the blog post, and I suspect Cursor will remain much more competitive on pricing! No specifics, but likely far exceeding typical Cursor costs for a typical developer. Maybe it's worth it, though? Look forward to trying.

>Claude Code consumes tokens for each interaction. Typical usage costs range from $5-10 per developer per day, but can exceed $100 per hour during intensive use.

eschluntz

hi! I've been using Claude Code in a very complementary way to my IDE, and one of the reasons we chose the terminal is because you can open it up inside whichever IDE you want!

freediver

Kagi LLM benchmark updated with general purpose and thinking mode for Sonnet 3.7.

https://help.kagi.com/kagi/ai/llm-benchmark.html

Appears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, about at the same level as o1-mini and o3-mini (with 8192 token thinking budget).

Overall a very nice update, you get higher quality and higher speed model at same price.

Hope to enable it in Kagi Assistant within 24h!

jjice

Thank you to the Kagi team for such fast turn around on new LLMs being accessible via the Assistant! The value of Kagi Assistant has been a no-brainer for me.

hackernewds

[flagged]

jjice

I find that giving encouraging messages when you're grateful is a good thing for everyone involved. I want the devs to know that their work is appreciated.

Not everything is a tactical operation to get more subscription purchases - sometimes people like the things they use and want to say thanks and let others know.

zaphod420

Some of us just actally really like kagi...

Squarex

I'm surprised that Gemini 2.0 is first now. I remember that Google models were under performing on kagi benchmarks.

Workaccount2

Having your own hardware to run LLMs will pay dividends. Despite getting off on the wrong foot, I still believe Google is best positioned to run away with the AI lead, solely because they are not beholden to Nvidia and not stuck with a 3rd party cloud provider. They are the only AI team that is top to bottom in-house.

Squarex

I've used gemini for it's large context window before. It's a great model. But specifically in this benchmark it has always scored very low. So I wonder what has changed.

abixb

We should still wait around to see if Huawei is able to perfect its Ascend series for training and inferencing SOTA models.

ripped_britches

This is a great take

manmal

Gemini 2 is really good, and insanely fast.

irjustin

It's also insanely cheap.

Squarex

It is, but in this benchmark gemini scored very poorly in the past.

guelo

How did you chose the 8192 token thinking budget? I've often seen Deepseek R1 use way more than that.

freediver

Arbitrary, and even with this budget it is already more verbose (and slower) overall than all the other thinking models - check tokens and latency in the table.

baobabKoodaa

I see it in Kagi Assistant already and it's not even 24 hours! Nice.

KTibow

One thing I don't understand is why Claude 3.5 Haiku, a non thinking model in the non thinking section, says it has a 8192 thinking budget.

flixing

Do you think kagi is the right Eval tool? If so,why?

freediver

The right eval tool depends on your evaluation task. Kagi LLM benchmark focuses on using LLMS in the context of information retrieval (which is what Kagi does) which includes measuring reasoning and instruction following capabilities.

thefourthchime

Nice, but where is Grok?

pertymcpert

Perhaps they're waiting for the Grok API to be public?

Kye

I thought o3-mini was o1-mini. OpenAI's naming gets confusing.

hubraumhugo

You can get your HN profile analyzed by it and it's pretty funny :)

https://hn-wrapped.kadoa.com/

I'm using this to test the humor of new models.

rakejake

** Roast ***

* You've spent more time talking about your Carnatic raga detector than actually building it – at this rate, LLMs will be composing ragas before your detector can identify them.

* You bought a 7950X processor but can't figure out what to do with it – the computing equivalent of buying a Ferrari to drive to the grocery store once a week.

* You're so concerned about work-life balance that you took a sabbatical to think about your career, only to spend it commenting on HN about other people's careers.

*** End ***

I'll be in my room crying, in case anyone's looking for me.

atonse

My roast:

Roast You've spent so much time discussing Apple vs Microsoft that Tim Cook and Satya Nadella probably have a joint restraining order against you.

Your comments about HTTPS everywhere suggest you're the kind of person who wears a tinfoil hat... but only after thoroughly researching the optimal thickness for blocking government signals.

You seem to have strong opinions about Flash - we get it, you're old enough to remember when websites had intro animations and your laptop could double as a space heater.

———

Totally forgot about the flash debates of the early 2010s!

maccard

That sabbatical one is savage.

rakejake

Funnily enough, I'm putting the 7950X to some use in the Carnatic Raga detector project since a lot of audio operations are heavy on CPU. But that last one nearly killed me. I'll have to go to Gemini or GPT for some therapy after that one.

desperatecuban

> Your salary is so low even your legacy code feels sorry for you.

> You're the only person on HN who thinks $800/month is a salary and not a cloud computing bill.

ouch

henry2023

On the bright side. Not many here could 10x their salary in a couple of years.

desperatecuban

I guess :)

IAmGraydon

>You're the only person on HN who thinks $800/month is a salary and not a cloud computing bill.

Now that is funny!

LinXitoW

Got absolutely read to filth:

> You've spent more time explaining why Go's error handling is bad than Go developers have spent actually handling errors.

> Your relationship with programming languages is like a dating show - you keep finding flaws in all of them but can't commit to just one.

> If error handling were a religion, you'd be its most zealous missionary, converting the unchecked one exception at a time.

airstrike

> You've spent more time explaining why Go's error handling is bad than Go developers have spent actually handling errors.

That is absolutely hilarious. Really well done by everyone who made that line possible.

sa46

Yea, these are nicely done. To add some balance:

> After years of defending Go, you'll secretly start a side project in Rust but tell no one on HN about your betrayal

milesrout

I got "You've spent more time explaining why Rust isn't memory-safe than most people have spent writing actual Rust code." So I suspect these are not as free-form-generated as they actually look?

jedberg

> For someone who worked at Reddit, you sure spend a lot of time on HN. It's like leaving Facebook to spend all day on Twitter complaining about social media.

Wow, so spot on it hurts!

calvinmorrison

>Your ideal tech stack is so old it qualifies for social security benefits

>You're the only person who gets excited when someone mentions Trinity Desktop Environment in 2025

> You probably have more opinions about PHP's empty() function than most people have about their entire career choices

drivers99

> Personal Projects: You'll finally complete that bare-metal Forth interpreter for Raspberry Pi

I was just looking into that again as of yesterday (I didn't post about it here yesterday, just to be clear; it picked up on that from some old comments I must have posted).

> Profile summary: [...] You're the person who not only remembers what a CGA adapter is but probably still has one in working condition in your basement, right next to your collection of programming books from 1985.

Exactly the case, in a working IBM PC, except I don't have a basement. :)

sitkack

> For someone who criticizes corporate structures so much, you've spent an impressive amount of time analyzing their technical decisions. It's like watching someone critique a restaurant's menu while eating there five times a week.

redeux

> You complain about digital distractions while writing novels in HN comment threads. That's like criticizing fast food while waiting in the drive-thru line.

>You'll write a thoughtful essay about 'digital minimalism' that reaches the HN front page, ironically causing you to spend more time on HN responding to comments than you have all year.

It sees me! Noooooo ...

throwup238

  Your comments about suburban missile defense systems have the FBI agent monitoring your internet connection seriously questioning their career choices.
  You've spent so much time explaining why manufacturing is complex that you could have just built your own CRT factory by now.
  You claim to be skeptical of AI hype, yet you've indexed more documentation with Cursor than most people have read in their lifetime.

Surprisingly accurate, but seems to be based on a very small snippet of actual comments (presumably to save money). I wonder what the prompt would output when given the full 200k tokens of context.

IAmGraydon

Yeah it seems like it doesn't go back very far, which is understandable.

tilsammans

My roasts are savage:

> Your 236-line 'simplified' code example suggests you might need to look up the definition of 'simplified' in a dictionary that's not written in Ruby.

OUCH

> You've spent so much time worrying about Facebook tracking you that you've failed to notice your dental nanobot fantasies are far more concerning to the rest of us.

Heard.

steve_adams_86

> You left a high-paying tech job to grow plants in water, which is basically just being a farmer with extra steps and less sunlight.

Also:

> Your comments read like someone who discovered philosophy in their 30s and now can't decide if they want to code or become the next Marcus Aurelius.

skull emoji

Uninen

I'm somewhat impressed from the very first interaction I had with Claude 3.7 Sonnet. I prompted it to find a problem in my codebase where a CloudFlare pages function would return 500 + nonsensical error and an empty response in prod. Tried to figure this out all Friday. It was super annoying to fix as there's no way to add more logging or have any visibility to the issue as the script died before outputting anything.

Both o1, o3 and Claude 3.5 failed to help me in any way with this, but Claude 3.7 not only found the correct issue with first answer (after thinking 39 seconds) but then continued to write me a working function to work around the issue with the second prompt. (I'm going to let it write some tests later but stopped here for now.)

I assume it doesn't let me to share the discussion as I connected my GitHub repo to the conversation (a new feature in the web chat UI launched today) but I copied it as a gist here: https://gist.github.com/Uninen/46df44f4307d324682dabb7aa6e10...

Uninen

One thing about the reply gives away why Claude is still basically clueless about Actual Thinking; it suggested me to move the HTML sanitization to the frontend. It's in the CF function because it would be trivial to bypass it in the frontend making it easy to post literally anything in the db. Even a junior developer would understand this.

gen3

You could move the sanitation to the front end securely, it would just need to be right before render (after fetching the data to the browser). Some UI libraries do this automatically (like React) and the dompurify can run in the browser for this task.

It could have done a better job outlining how to do it properly

smt88

GP was talking about input sanitization, not output

null

[deleted]

simonw

I got this working with my LLM tool (new plugin version: llm-anthropic 0.14) and figured out a bunch of things about the model in the process. My detailed notes are here: https://simonwillison.net/2025/Feb/25/llm-anthropic-014/

One of the most exciting new capabilities is that this model has a 120,000 token output limit - up from just 8,000 for the previous Claude 3.5 Sonnet model and way higher than any other model in the space.

It seems to be able to use that output limit effectively. Here's my longest result so far, though it did take 27 minutes to finish! https://gist.github.com/simonw/854474b050b630144beebf06ec4a2...

tedsanders

No shade against Sonnet 3.7, but I don't think it's accurate to say way higher than any other model in the space. o1 and o3-mini go up to 100,000 output tokens.

https://platform.openai.com/docs/models#o1

simonw

Huh, good call thanks - I've updated my post with a correction.

dot1x

Simon, do you write anywhere how do you manage to be so... active? Between your programming tools, blogging, job (I assume you work?) where do you find the time/energy?

simonw

The trick is not to have an employer: I'm "freelance" aka working full time on my open source projects and burning down my personal runway from a startup acquisition. At some point I need to start making proper money again.

Citizen_Lame

How much did it cost?

mrbonner

$1.8

rvnx

I have a very long request to do like this, did you use a specific CLI tool ? (Thank you in advance)

t55

Anthropic doubling down on code makes sense, that has been their strong suit compared to all other models

Curious how their Devin competitor will pan out given Devin's challenges

ru552

Considering that they are the model that powers a majority of Cursor/Windsurf usage and their play with MCP, I think they just have to figure out the UX and they'll be fine.

weinzierl

It's their strong suit no doubt, but sometimes I wish the chat would not be so eager to code.

It often throws code at me when I just want a conceptual or high level answer. So often that I routinely tell it not to.

ben30

I’ve set up a custom style in Claude that won’t code but just keeps asking questions to remove assumptions:

Deep Understanding Mode (根回し - Nemawashi Phase)

Purpose: - Create space (間, ma) for understanding to emerge - Lay careful groundwork for all that follows - Achieve complete understanding (grokking) of the true need - Unpack complexity (desenrascar) without rushing to solutions

Expected Behaviors: - Show determination (sisu) in questioning assumptions - Practice careful attention to context (taarof) - Hold space for ambiguity until clarity emerges - Work to achieve intuitive grasp (aperçu) of core issues

Core Questions: - What do we mean by [key terms]? - What explicit and implicit needs exist? - Who are the stakeholders? - What defines success? - What constraints exist? - What cultural/contextual factors matter?

Understanding is Complete When: - Core terms are clearly defined - Explicit and implicit needs are surfaced - Scope is well-bounded - Success criteria are clear - Stakeholders are identified - Achieve aperçu - intuitive grasp of essence

Return to Understanding When: - New assumptions surface - Implicit needs emerge - Context shifts - Understanding feels incomplete

Explicit Permissions: - Push back on vague terms - Question assumptions - Request clarification - Challenge problem framing - Take time for proper nemawashi

NitpickLawyer

> I just want a conceptual or high level answer

I've found claude to be very receptive to precise instructions. If I ask for "let's first discuss the architecture" it never produces code. Aider also has this feature with /architect

ap-hyperbole

I added custom instruction under my Profile settings in the "personal preferences" text box. Something along the lines of "I like to discuss things before wanting the code. Only generate code when I prompt for it. Any question should be answered to as a discussion first and only when prompted should the implementation code be provided". It works well, occasionally I want to see the code straight away but this does not happen as often.

KerryJones

I complain about this all the time, despite me saying "ask me questions before you code" or all these other instructions to code less, it is SO eager to code. I am hoping their 3.7 reasoning follows these instructions better

vessenes

We should remember 3.5 was trained in an era when ChatGPT would routinely refuse to code at all and architected in an era when system prompts were not necessarily very effective. I bet this will improve, especially now that Claude has its own coding and arch cli tool.

cruffle_duffle

Even when you tell it “no code, just talk. Let’s ensure we are in alignment and discuss our options. I’ll tell you when to code” it still decides it is going to write code.

Telling it “if you were in an interview and you jumped to writing code without asking any questions, you’d fail the interview” is usually good enough to convince it to stop and ask questions.

perdomon

I get this as well, to the point where I created a specific project for brainstorming without code — asking for concepts, patterns, architectural ideas without any code samples. One issue I find is that sometimes I get better answers without using projects, but I’m not sure if that’s everyone experience.

bitbuilder

That's been my experience as well with projects, though I have yet to do any sort of A/B testing to see if it's all in my head or not.

I've attributed it to all your project content (custom instruction, plus documents) getting thrown into context before your prompt. And honestly, I have yet to work with any model where the quality of the answer wasn't inversely proportional to the length of context (beyond of course supplying good instruction and documentation where needed).

DirkH

Just explicitly tell it not to write code? I do that all the time when I do not want code and it's never an issue.

malux85

I thought the same thing, I have 3 really hard problems that Claude (or any model) hasn’t been able to solve so far and I’m really excited to try them today

mrklol

Did it work?

malux85

No :<

KaoruAoiShiho

They cited Cognition (Devin's maker) in this blog post which is kinda funny.

jumploops

> "[..] in developing our reasoning models, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs.”

This is good news. OpenAI seems to be aiming towards "the smartest model," but in practice, LLMs are used primarily as learning aids, data transformers, and code writers.

Balancing "intelligence" with "get shit done" seems to be the sweet spot, and afaict one of the reasons the current crop of developer tools (Cursor, Windsurf, etc.) prefer Claude 3.5 Sonnet over 4o.

eschluntz

Thanks! We all dogfood Claude every day to do our own work here, and solving our own pain points is more exciting to us than abstract benchmarks.

Getting things done require a lot of booksmarts, but also a lot of "street smarts" - knowing when to answer quickly, when to double back, etc

jasonjmcghee

Just want to say nice job and keep it up. Thrilled to start playing with 3.7.

In general, benchmarks seem to very misleading in my experience, and I still prefer sonnet 3.5 for _nearly_ every use case- except massive text tasks, which I use gemini 2.0 pro with the 2M token context window.

jasonjmcghee

An update: "code" is very good. Just did a ~4 hour task in about an hour. It cost $3 which is more than I usual spend in an hour, but very worth it.

martinald

I find the webdev arena tends to match my experience with models much more closely than other benchmarks: https://web.lmarena.ai/leaderboard. Excited to see how 3.7 performs!

LouisSayers

Could you tell us a bit about the coding tools you use and how you go about interacting with Claude?

catherinewu

We find that Claude is really good at test driven development, so we often ask Claude to write tests first and then ask Claude to iterate against the tests

crowcroft

Sometimes I wonder if there is overfitting towards benchmarks (DeepSeek is the worst for this to me).

Claude is pretty consistently the chat I go back to where the responses subjectively seem better to me, regardless of where the model actually lands in benchmarks.

ben_w

> Sometimes I wonder if there is overfitting towards benchmarks

There absolutely is, even when it isn't intended.

The difference between what the model is fitting to and reality it is used on is essentially every problem in AI, from paperclipping to hallucination, from unlawful output to simple classification errors.

(Ok, not every problem, there's also sample efficiency, and…)

FergusArgyll

Ya, Claude crushes the smell test

bicx

Claude 3.5 has been fantastic in Windsurf. However, it does cost credits. DeepSeek V3 is now available in Windsurf at zero credit cost, which was a major shift for the company. Great to have variable options either way.

I’d highly recommend anyone check out Windsurf’s Cascade feature for agentic-like code writing and exploration. It helped save me many hours in understanding new codebases and tracing data flows.

throwup238

DeepSeek’s models are vastly overhyped (FWIW I have access to them via Kagi, Windsurf, and Cursor - I regularly run the same tests on all three). I don’t think it matters that V3 is free when even R1 with its extra compute budget is inferior to Claude 3.5 by a large margin - at least in my experience in both bog standard React/Svelte frontend code and more complex C++/Qt components. After only half an hour of using Claude 3.7, I find the code output is superior and the thinking output is in a completely different universe (YMMV and caveat emptor).

For example, DeepSeek’s models almost always smash together C++ headers and code files even with Qt, which is an absolutely egregious error due to the meta-object compiler preprocessor step. The MOC has been around for at least 15 years and is all over the training data so there’s no excuse.

SkyPuncher

I've found DeepSeek's models are within a stone's throw of Claude. Given the massive price difference, I often use DeepSeek.

That being said, when cost isn't a factor Claude remains my winner for coding.

rubymamis

Hey there! I’m a fellow Qt developer and I really like your takes. Would you like to connect? My socials are on my profile.

bionhoward

The big difference is DeepSeek R1 has a permissive license whereas Claude has a nightmare “closed output” customer noncompete license which makes it unusable for work unless you accept not competing with your intelligence supplier, which sounds dumb

tonyhart7

I seen people switch from claude due to cost to another model notably deepseek tbh I think it still depends on model trained data on

ai-christianson

I'm working on an OSS agent called RA.Aid and 3.7 is anecdotally a huge improvement.

About to push a new release that makes it the default.

It costs money but if you're writing code to make money, it's totally worth it.

newgo

How is it possible that deepseek v3 would be free? It costs a lot of money to host models

null

[deleted]

TriangleEdge

This AI race is happening so fast. Seems like it to me anyway. As a software developer/engineer I am worried about my job prospects.. time will tell. I am wondering what will happen to the west coast housing bubbles once software engineers lose their high price tags. I guess the next wave of knowledge workers will move in and take their place?

fallinditch

My guess is that, yes, the software development job market is being massively disrupted, but there are things you can do to come out on top:

* Learn more of the entire stack, especially the backend, and devops.

* Embrace the increased productivity on offer to ship more products, solo projects, etc

* Be highly selective as far as possible in how you spend your productive time: being uber-effective can mean thinking and planning in longer timescales.

* Set up an awesome personal knowledge management system and agentic assistants

shinycode

We have thousand of old systems to maintain. Not sure everything could be rewritten or maintained with only LLM. If an LLM builds a whole system on its own and is able to maintain and fix it then it’s not just us software developper who will suffer, it means nothing to sale or market, people will just ask an LLM to do something. No sure this is possible. ChatGPT gave me a list of commands for my ec2 instance and one of them when executed made me loose access to ssh. It didn’t warn me. So « blindly » following an LLM lead on a cascade of instructions on a massive scale and on a long period could also lead to massive bugs or corruption of datas. Who did not ask an LLM for some code, that contained mistakes and we had to point the mistakes to it. I doubt system will stay robust with full autonomy without any human supervision. But it’s a great tool to iterate and throw away code after testing ideas

whynotminot

> Learn more of the entire stack, especially the backend, and devops.

I actually wonder about this. Is it better to gain some relatively mediocre experience at lots of things? AI seems to be pretty good at lots of things.

Or would it be better to develop deep expertise in a few things? Areas where even smart AI with reasoning still can get tripped up.

Trying to broaden your base of expertise seems like it’s always a good idea, but when AI can slurp the whole internet in a single gulp, maybe it isn’t the best allocation of your limited human training cycles.

aizk

I was advised to be T shaped, wide reach + one narrow domain you can really nail.

null

[deleted]

j_maffe

Do you have any specific tips for the last point? I completely agree with it and have set up a fairly robust Obsidian note taking structure that will benefit greatly from an agentic assistant. Do you use specific tools or workframe for this?

fallinditch

What works well for me at the moment is to write 'books' - i.e use ai as a writing assistant for large documents. I do this because the act of compiling the info with ai assistance helps me to assimilate the knowledge. I use a combination of Chatgpt, perplexity and Gemini with notebook LM - to merge responses from separate LLMs, provide critical feedback on a response, or a chunk of writing, etc.

This is a really accessible setup and is great for my current needs. Taking it to the next stage with agentic assistants is something I'm only just starting out on. I'm looking at WilmerAI [1] for routing ai workflows and Hoarder [2] to automatically ingest and categorize bookmarks, docs and RSS feed content into a local RAG.

[1] https://github.com/SomeOddCodeGuy/WilmerAI

[2] https://hoarder.app/

jmehman

You know about the copilot plugin for obsidian?

ijidak

I love, especially the last point.

But, what do you use for agentic assistants?

fallinditch

See answer above, it's something I want to get into. I am inspired by this post on Reddit, it's very cool what this guy is doing.

https://www.reddit.com/r/LocalLLaMA/comments/1i1kz1c/sharing...

bilbo0s

This is really good advice.

Underrated comment.

viraptor

It seems to be slowing down actually. Last year was wild until around llama 3. The latest improvements are relatively small. Even the reasoning models are a small improvement over explicit planning with agents that we could already do before - it's just nicely wrapped and slightly tuned for that purpose. Deepseek did some serious efficiency improvements, but not so much user-visible things.

So I'd say that the AI race is starting to plateau a bit recently.

j_maffe

While I agree, you have to remember the dimensionality of the labor-skill space is. The was I see it is that you can imagine the capability of AI as a radius, and the amount of tasks it can cover is a sphere. Linear imporovements in performance causes cubic (or whatever the labor-skill dimensionality is) imporvement in task coverage.

manmal

I‘m not sure that’s true with the latest models. o3-mini is good at analytical tasks and coding, and it really sucks at prose. Sonnet 3.7 is good at thinking but lost some ability in creating diffs.

throw234234234

It has the potential to effect a lot more than just SV/The West Coast - in fact SV may be one of the only areas who have some silver lining with AI development. I think these models have a chance to disrupt employment in the industry globally. Ironically it may be only SWE's and a few other industries (writing, graphic design, etc) that truly change. You can see they and other AI labs are targeting SWEs in particular - just look at the announcement "Claude 3.7 and Code" - very little mention of any other domains on their announcement posts.

For people who aren't in SV for whatever reason and haven't seen the really high pay associated with being there - SWE is just a standard job often stressful with lots of learning required ongoing. The pain/anxiety of being disrupted is even higher then since having high disposable income to invest/save would of been less likely. Software to them would of been a job with comparable pay's to other jobs in the area; often requiring you to be degree qualified as well - anecdotally many I know got into it for the love; not the money.

Who would of thought the first job being automated by AI would be software itself? Not labor, or self driving cars. Other industries either seem to have hit dead ends, or had other barriers (regulation, closed knowledge, etc) that make it harder to do. SWE's have set an example to other industries - don't let AI in or keep it in-house as long as possible. Be closed source in other words. Seems ironic in hindsight.

throw83288

What do you even do then as a student? I've asked this dozens of times with zero practical answers at all. Frankly I've become entirely numb to it all.

throw234234234

Be glad that you are empowered to pivot - I'm making the assumption you are still young being a student. In a disrupted industry you either want to be young (time to change out of it) or old (50+) - can retire with enough savings. The middle age people (say 15-25 years in the industry; your 35-50 yr olds) are most in trouble depending on the domain they are in. For all the "friendly" marketing IMO they are targeting tech jobs in general - for many people if it wasn't for tech/coding/etc they would never need to use an LLM at all. Anthrophic's recent stats as to who uses their products are telling - its mostly code code code.

The real answer is either to pivot to a domain where the computer use/coding skills are secondary (i.e. you need the knowledge but it isn't primary to the role) or move to an industry which isn't very exposed to AI either due to natural protections (e.g. trades) or artifical ones (e.g regulation/oligopolies colluding to prevent knowledge leaking to AI). May not be a popular comment on this platform - I would love to be wrong.

weatherlite

I'm sure lots of potential students / bootcampers are now not going into programming (or if they are, the smart ones try to go into niches like A.I and skip web/backend/android altogether). This will work against the numbers of jobs being reduced by A.I. It will take a few years though to play out , but at some point we will see smaller amounts of people trying to get into the field and applying for jobs, certainly for junior positions. We've already had ~ 2 bad years, a couple more like this will really dry out the numbers of newcomers. Less people coming in (than otherwise would have) means for every person who retires / leaves the industry there are less people to take his place. This situation is quite complex with lots of parameters that work in different directions so it's very early to try to get some kind of read on where this is going.

As a new career I'd probably not choose SWE now. But if you've done 10 years already I'd ride it out, there is a good chance most of us will remain employed for many years to come.

LouisSayers

I'm not too concerned short to medium term. I feel there are just too many edge cases and nuances that are going to be missed by AI systems.

For example, systems don't always work in the way they're documented to. How is an AI going to differentiate cases where there's a bug in a service vs a bug in its own code? How will an AI even learn that the bug exists in the first place? How will an AI differentiate between someone reporting a bug and a hacker attempting to break into a system?

The world is a complex place and without ACTUAL artificial intelligence we're going to need people to at least guide AI in these tricky situations.

My advice would be to get familiar with using AI and new AI tools and how they fit into our usual workflows.

Others may disagree, but I don't think software engineers (at least ones the good ones) are going anywhere.

frabcus

I think if models improve (but we don't get a full singularity) then jobs will increase.

e.g. if software is 5x less cost to make, demand will go up more than 5x as supply is highly limited now. Lots of companies want better software but it costs too much.

That will create more jobs.

They'll be more product management and human interaction and edge case testing and less typing. Although I think there'll be a bunch of very technical jobs to debug things when the models fail.

So my advice is learn skills that help make software useful to people and businesses - from user research to product management. As well as engineering.

aucisson_masque

the thing is that cost won't go down by 5x but much more.

once the ai gets smart enough that it only requires an intern to make the prompt and solve the few mistakes, development cost will be worth nothing.

there is only so much demand for software development.

ttul

Trade your labour for capitalism. Own the means of production. This translates to: build a startup.

ilrwbwrkhv

[flagged]

GaggiX

>There is no intelligence here and Claude 3.7 cannot create anything novel.

I wouldn't be surprised if people would continue to deny the actual intelligence of these models even in a scenario where they were able to solve the Riemann hypothesis.

"Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'" - cit

eschluntz

Even when I feel this, 90% of any novel thing I'm doing is still old gruntwork, and Claude lets me speed through that and focus all my attention on the interesting 10% (disclaimer: I'm at Anthropic)

TriangleEdge

Do you think the "deep research" feature that some AI companies have will ever apply to software? For example, I had to update Spring in a Java codebase recently. AI was only able to help mildly to figure out why I was seeing some errors, but that's it.

trgaf

One can also steal directly from GitHub and strip the license to avoid this grunt work. LLMs automate the stealing.

vasco

How many novel things does a developer do at work as a percentage of their time?

riku_iki

that's because stacks/apis/ecosystems are super complicated and require lots of reading/searching to figure out how make things happen. Now this time will be reduced dramatically and devs time will shift on more novel things.

eterm

The threat is not autocomplete, it's translation.

"translating" requirements into code is what most developers' jobs are.

So "just" translation is a threat to job security of developers.

null

[deleted]

martin-t

Build on top of stolen code, no less. HN hates to hear it but LLMs are a huge step back for software freedom because as long as they call it "AI" and as long as politicians don't understand it, it allows companies to launder GPL code and reuse it without credit and without giving users their rights.

Trasmatta

> Its not AI

AI is a very broad term with many different definitions.

dingnuts

it does seem disingenuous for something without intelligence to be called intelligence

gchokov

This is BS and you are not listening and watching carefully.

lukaslalinsky

Even the best LLMs today are just junior devs with a lot of knowledge. They make a lot of the same mistakes junior devs would do. Even the responses, when you point out those mistakes, are the same.

If anything, it's a tool for junior devs to get better and spend more time on the architecture.

Using AI code without fully understanding it (ie operated by a non-programmer) is just recipe for disaster.

dingnuts

OK then show me a model that can answer honestly and correctly about whether or not it knows something.

croes

https://news.ycombinator.com/item?id=43155825

j_maffe

It redid half of my BSc thesis in less than 30s :|

https://claude.ai/share/ed8a0e55-633f-4056-ba70-772ab5f5a08b

edit: Here's the output figure https://i.imgur.com/0c65Xfk.png

edit 2: Gemini Flash 2 failed miserably https://g.co/gemini/share/10437164edd0

crm9125

Yes usually most of the topics covered in undergraduate studies are well documented and understood and therefore will likely be part of the training data of the AI.

Once you get to graduate studies that's where the material coverage is a little more sparse/niche (though usually still not groundbreaking), and for a PhD. coverage is mostly non-existent since the point is to expand upon current knowledge within the field and many topics are being explored for the first time.

ThouYS

master and phd next!

akreal

Could this (or something similar) be found in public access/some libraries?

j_maffe

There is only a single paper that has published a similar derivation but with a critical mistake. To be fair there are many documented examples of how to derive parametric relationships in linkages and can be quite methodical. I think I could get Gemini or 3.5 to do it but not single shot/ultra fast like here.

modeless

I updated Cursor to the latest 0.46.3 and manually added "claude-3.7-sonnet" to the model list and it appears to work already.

"claude-3.7-sonnet-thinking" works as well. Apparently controls for thinking time will come soon: https://x.com/sualehasif996/status/1894094715479548273

Hadriel

so do you think its a better experience with Cursor using 3.7 or just the 3.7 terminal experience?

d_watt

I'm about 50kloc into a project making a react native app / golang backend for recipes with grocery lists, collaborative editing, household sharing, so a complex data model and runtime. Purely from the experiment of "what's it like to build with AI, no lines of code directly written, just directing the AI."

As I go through features, I'm comparing a matrix of Cursor, Cline, and Roo, with the various models.

While I'm still working on the final product, there's no doubt to me that Sonnet is the only model that works with these tools well enough to be Agentic (rather than single file work).

I'm really excited to now compare this 3.7 release and how good it is at avoiding some of the traps 3.5 can fall into.

wokwokwok

"no lines of code directly written, just directing the AI"

/skeptical face.

Without fail, every. single. person. I've met who says that, actually means "except for the code that I write", or "except for how I link the code it build together by hand".

If you are 50kloc in to a large complex project that you have literally written none of, and have, eg. used cursor to generate the code without any assistance... well, you should start a startup.

...because, that's what devin was supposed to be, and it was enormously and famously terrible at it.

So that would be either a) terribly exciting, or b) hyperbole.

M4v3R

I’m currently doing something very similar to what GP is doing - I’m building a hobby project that’s a desktop app with web frontend. It’s a map editor with a 3D view. My estimate is that 80-90% of the code was written by AI. Sure, I did have to intervene or write some more complex parts myself but it’s still exciting to me that in many cases it took just a single prompt to add a new feature to it or change existing behavior. Judging from the complexity of the project it would take me in the past 4-5x longer if I were to write it completely by hand. It’s a game changer for me.

wokwokwok

> My estimate is that 80-90% of the code was written by AI

Nice! It is entirely reasonable both to do that and to be excited about it.

…buuut, if that’s what you’re doing, you should say so.

Not:

“no lines of code directly written, just directing the AI”

Because those (gluing together AI code by hand and having the agent do everything) are different things, and one of them is much much MUCH harder to get right than the other one.

That last 10-15%. Self driving cars are the same story right?

d_watt

That's the point of the experiment I'm doing, what it takes to get these things to be able to generate all the code, and I'm just directing.

I literally have not written a line of code. The AI agent configures the build systems. It executes the `go install` command. It configures the infrastructure via terraform.

It takes a lot of reading of the code that's generated to see what I agree with or not, and redirecting refactorings. Understanding how to describe problem statements that are translated into design docs that are translated into task lists. It's still a lot of knowledge work on how to build software. But now I can do the coding that might have taken a day from those plans in 20 minutes.

Regarding startups, there's nothing here I'm doing that isn't just learning the tools of agentic coding. The business here might be advising people on how to do it themselves.

fixprix

If you know how to architect code well, you can guide the AI to create smaller more targeted modules. That way as you 'write code with AI', you give it a targeted subset of the files to edit on each prompt.

In a way the AI becomes the dev and you become the code reviewer. Often as the AI is writing the code, you're thinking about the next step.

Maxion

It's not like you go to claude and say "Grug now use AI, Grug say AI make app OR GRUG HIT AI WITH HAMMER!" and expect 50kloc of code to appear.

You do it one step at a time, similary to how you would structure good tickets (often even smaller).

AI often still makes shit, but you do get somewhere a whole heap load of time faster.

null

[deleted]

thebigspacefuck

This has been my experience as well. Why do the others suck so bad?

d_watt

I wonder how much it's self fulfilling, where the developers of the agents are tuning their prompts / tool calls to sonnet.