I've been using Claude Code for a couple of days
506 comments
·March 9, 2025develoopest
abxyz
I think it's probably the difference between "code" and "programming". An LLM can produce code and if you're willing to surrender to the LLMs version of whatever it is you ask for, then you can have a great and productive time. If you're opinionated about programming, LLMs fall short. Most people (software engineers, developers, whatever) are not "programmers" they're "coders" which is why they have a positive impression of LLMs: they produce code, LLMs produce code... so LLMs can do a lot of their work for them.
Coders used to be more productive by using libraries (e.g: don't write your own function for finding the intersection of arrays, use intersection from Lodash) whereas now libraries have been replaced by LLMs. Programmers laughed at the absurdity of left-pad[1] ("why use a dependency for 16 lines of code?") whereas coders thought left-pad was great ("why write 16 lines of code myself?").
If you think about code as a means to an end, and focus on the end, you'll get much closer to the magical experience you see spoken about on Twitter, because their acceptance criteria is "good enough" not "right". Of course, if you're a programmer who cares about the artistry of programming, that feels like a betrayal.
miki123211
Oh, this captures my experience perfectly.
I've been using Claude Code a lot recently, and it's doing amazing work, but it's not exactly what I want it to do.
I had to push it hard to refactor and simplify, as the code it generated was often far more complicated than it needed to be.
To be honest though, most of the code it generated I would accept if I was reviewing another developer's work.
I think that's the way we need to look at it. It's a junior developer that will complete our tasks, not always in our preferred way, but at 10x the speed, and frequently make mistakes that we need to point out in CR. It's not a tool which will do exactly what we would.
biker142541
My experience so far on Claude 3.7 has been over-engineered solutions that are brittle. Sometimes they work, but usually not precisely the way I prompted it to, and often attempts to modify them require more refactoring due to the unnecessary complexity.
This has been the case so far in both js for web (svelte, react) and python automation.
I feel like 3.5 generally came up "short" more often than 3.7, but in practical usage it meant I could more easily modify and build on top of. 3.7 has led to a lot of deconstructing, reprompting, starting over.
jmull
All I really care about is the end result and, so far, LLMs are nice for code completion, but basically useless for anything else.
They write as much code as you want, and it often sorta works, but it’s a bug filled mess. It’s painstaking work to fix everything, on part with writing it yourself. Now, you can just leave it as-is, but what’s the use of releasing software that crappy?
I suppose it’s a revolution for that in-house crapware company IT groups create and foist on everyone who works there. But the software isn’t better, it just takes a day rather than 6 months (or 2 years or 5 years) to create. Come to think of it, it may not be useful for that either… I think the end-purpose is probably some kind of brag for the IT manger/exec, and once people realize how little effort is involved it won’t serve that purpose.
barbazoo
I love the subtle mistakes that get introduced in strings for example that then take me all the time I saved to fix.
fallinditch
Have you tried using Cursor rules? [1]
Creating a standard library stdlib with many (potentially thousands) of rules, and then iteratively adding to and amending the rules as you go, is one of the best practices for successful AI coding.
bikamonki
Right on point. The same principle applies when deciding whether to use a framework or not. Coders often marvel at the speed with which they can build something using a framework they don’t fully understand. However, a true programmer seeks to know and comprehend what’s happening under the hood.
InvertedRhodium
I'll preface this with the fact I agree that there is a difference between using a framework and being curious enough to dig into the details I think you're veering into No True Scotsman territory here.
IMO, the vast majority of programmers wouldn't meet the definition you've put forward here. I don't know many that dig into the operating system, platform, or hardware all that much, though I work in streaming media so that might just be an industry bias.
icedchai
This aligns with my experience. I've seen LLMs produce "code" that the person requesting is unable to understand or debug. It usually almost works. It's possible the person writing the prompt didn't actually understand the problem, so they got a half baked solution as a result. Either way, they need to go to a human with more experience to figure it out.
hsuduebc2
Tbh If I do not understand generated code perfectly, meaning it is using slightly something I do not know I usually spend approximately same time understanding generated code as writing it myself.
beezlewax
I'm waiting for artisan programming to become a thing.
discordance
by 100% organic, free range and fair trade programmers
pydry
Artisanal code has been a thing for a long while.
If we're the luddite artisans, LLMs seem to represent the knitting frames which replaced their higher quality work with vastly cheaper, far crappier merchandise. There is a historical rhyme here.
LeftHandPlane
Artisanal firmware is the future (or the past? or both?): https://www.youtube.com/watch?v=vBXsRC64hw4
jdmoreira
From before people even knew what llms were: https://handmade.network
dr_dshiv
Like, writing binaries directly? Is assembly code too much of an abstraction?
someothherguyy
> If you think about code as a means to an end, and focus on the end
The problem with this is that you will never be able to modify the code in a meaningful way after it crosses a threshold, so either you'll have a prompt only modification ability, or you will just have to rewrite things from scratch.
I wrote my first application ever (equivalent to a education CMS today) in the very early 2000s with barely any notion of programming fundamentals. It was probably a couple hundred thousand lines of code by the time I abandoned it.
I wrote most of it in HTML, JS, ASP and SQL. I was in high school. I didn't know what common data structures were. I once asked a professor when I got into late high school "why arrays are necessary in loops".
We called this cookbook coding back in the day.
I was pretty much laughed at when I finally showed people my code, even though it was a completely functional application. I would say an LLM probably can do better, but it really doesn't seem like something we should be chasing.
oxag3n
I tried LLMs for my postgraduate "programming" tasks to create lower level data structures and algorithms that are possible to write a detailed requirements for - they failed miserably. When I pushed in certain directions, I've got student level replies like "collision probability is so low we can just ignore it", while same LLM accurately estimated that in my dataset there will be collisions.
And I don't believe until I see LLMs can use real debugger to figure out a root cause for a sophisticated, cascading bug.
mrits
This surrendering to the LLM has been going around a lot lately. I can only guess it is from people that haven't tried it very much themselves but love to repeat experiences from other people.
BeetleB
Some hints for people stuck like this:
Consider using Aider. It's a great tool and cheaper to use than Code.
Look at Aiders LLM leaderboard to figure out which LLMs to use.
Use its architect mode (although you can get quite fast without it - I personally haven't needed it).
Work incrementally.
I use at least 3 branches. My main one, a dev one and a debug one. I develop on dev. When I encounter a bug I switch to debug. The reason is it can produce a lot of code to fix a bug. It will write some code to fix it. That won't work. It will try again and write even more code. Repeat until fixed. But in the end I only needed a small subset of the new code. So you then revert all the changes and have it fix it again telling it the correct fix.
Don't debug on your dev branch.
Aider's auto committing is scary but really handy.
Limit your context to 25k.
Only add files that you think are necessary.
Combining the two: Don't have large files.
Add a Readme.md file. It will then update the file as it makes code changes. This can give you a glimpse of what it's trying to do and if it writes something problematic you know it's not properly understanding your goal.
Accept that it is not you and will write code differently from you. Think of it as a moderately experienced coder who is modifying the codebase. It's not going to follow all your conventions.
majormajor
> I use at least 3 branches. My main one, a dev one and a debug one. I develop on dev. When I encounter a bug I switch to debug. The reason is it can produce a lot of code to fix a bug. It will write some code to fix it. That won't work. It will try again and write even more code. Repeat until fixed. But in the end I only needed a small subset of the new code. So you then revert all the changes and have it fix it again telling it the correct fix.
how big/complex does the codebase have to be for this to be for you to actually save time compared to just using a debugger and fixing it yourself directly? (I'm assuming here that bugs in smaller codebases are that much easier for a human to identify quickly)
BeetleB
So far I've used Aider for only a few projects - almost all where it starts from scratch. And virtually always for personal use - not work. As such, the focus on quality is not as high (i.e. there's no downside to me letting it run wild).
So I hope you can understand it when I say: Why should I waste my brain cells debugging when I can just tell it to fix its own problems?
Say you want something done (for personal use), and don't have the time to develop it yourself. Someone volunteers to write it for you. You run it, and it fails. Would you spend much time reading code someone else wrote to find the bug, or just go back to the guy with the error?
Yes, I have had to debug a few things myself occasionally, but I do it only when it's clear that the LLM isn't able to solve it.
In other cases, I'm writing something in a domain I'm not fully knowledgeable (or using a library I'm only mildly familiar with). So I lack the knowledge to debug quickly. I would have to read the docs or Google. Why Google when the LLM is more effective at figuring it out? Certainly in a few cases, the solution turned out to require knowledge I did not have, and I appreciate that the LLM solved it (and I learned something as a result).
The point with all this is: The experience not binary. It's the full spectrum. For my main codebase I'm responsible for at work, I haven't bothered using an LLM (and I have access to Copilot). I need to ensure the quality of the code, and I don't want to spend my time understanding the code the LLM wrote - to the level I would need to feel comfortable pushing to production.
geoka9
Thanks, that's a helpful set of hints!
Can you provide a ballpark of what kind of $ costs we are talking here for using Aider with, say, Claude? (or any other provider that you think is better at the moment).
Say a run-of-the-mill bug-fixing session from your experience vs the most expensive one off the top of your head?
BeetleB
I've used it only a few times - mostly for projects written from scratch (not existing codebases). And so far only with Claude Sonnet.
Twice I had a "production ready" throwaway script in under $2 (the first was under a dollar). Both involved some level of debugging. But I can't overstate how awesome it is to have a single use script be so polished (command line arguments, extensive logging, etc). If I didn't make it polished, it would probably have been $0.15.
Another one I wrote - I probably spent $5-8 total, because I actually had it do it 3 times from scratch. The first 2 times there were things I wasn't happy with, or the code got too littered with attempts to debug (and I was not using 3 branches). When I finally figured everything out, I started again for the third time and it was relatively quick to get something working.
Now if I did this daily, it's a tad expensive - $40-60/month. But I do this only once or twice a week - still cheaper than paying for Copilot. If I plan to use it more, I'd likely switch to DeepSeek. If you look at the LLM leaderboard (https://aider.chat/docs/leaderboards/), you'll see that R1 is not far behind Sonnet, and is a third of the cost.
yoyohello13
When I was using aider with Claude 3.5 api it cost about $0.01 per action.
tptacek
The three-branch thing is so smart.
BeetleB
It took a while for me to realize it, and frankly, it's kind of embarrassing that I didn't think of it immediately.
It is, after all, what many of us would do in our manual SW development. But when using an LLM that seems pretty good, we just assume we don't need to follow all the usual good practices.
ddanieltan
do you have a special prompt to instruct aider to log file changes in the repo's README? I've used aider in repos with a README.md but it has not done this update. (granted, i've never /add the readme into aider's context window before either...)
Ey7NFZ3P0nzAe
Take a loot at conventions.md in aider.chat documentation
branko_d
I have the same experience.
Where AI shines for me is as a form of a semantic search engine or even a tutor of sorts. I can ask for the information that I need in a relatively complex way, and more often than not it will give me a decent summary and a list of "directions" to follow-up on. If anything, it'll give me proper technical terms, that I can feed into a traditional search engine for more info. But that's never the end of my investigation and I always try to confirm the information that it gives me by consulting other sources.
mentalgear
Exactly same experience: since the early-access GPT-3 days, I played out various scenarios, and the most useful case has always been to use generativeAI as semantic search. It's generative features are just lacking in quality (for anything other than a toy project), and the main issues since the early GPT days remains, even though it gets better, it's still too unreliable for (mid-complex systems) serious work. Also, if you don't pay attention, it messes up other parts of the code.
jofzar
Yeah I have had some "magic" moments where I knew "what" I needed, had an idea of "how it would look",but no idea how to do it and ai helped me understand how I should do it instead of the hacky very stupid way I would have done it
Yoric
Same here. In some cases, brainstorming even kinda works – I mean, it usually gives very bad responses, but it serves as a good duck.
Code? Nope.
smallerfish
I've done code interviews with hundreds of candidates recently. The difference between those who are using LLMs effectively and those who are not is stark. I honestly think engineers who think like OP are going to get left behind. Take a weekend to work on getting your head around this by building a personal project (or learning a new language).
A few things to note:
a) Use the "Projects" feature in Claude web. The context makes a significant amount of difference in the output. Curate what it has in the context; prune out old versions of files and replace them. This is annoying UX, yes, but it'll give you results.
b) Use the project prompt to customize the response. E.g. I usually tell it not to give me redundant code that I already have. (Claude can otherwise be overly helpful and go on long riffs spitting out related code, quickly burning through your usage credits).
c) If the initial result doesn't work, give it feedback and tell it what's broken (build messages, descriptions of behavior, etc).
d) It's not perfect. Don't give up if you don't get perfection.
triyambakam
Hundreds of candidates? That's significant if not an exaggeration. What are the stark differences you have seen? Did you inquire about the candidate's use of language models?
smallerfish
Yes. I do async video interviews in round 1 of my interview process in order to narrow the candidate funnel. Candidates get a question at the start of the interview, with a series of things to work through in their own IDE while sharing their screen. I review all recordings (though I will skip around, and if candidates don't get very far I won't spend a lot of time watching at 1x speed.) The question as laid out encourages them to use all of the tools they usually rely on while coding (including google, stackoverflow, LLMs, ...).
Candidates who use LLMs generally get through 4 or 5 steps in the interview question. Candidates who don't are usually still on step 2 by the end of the interview (with rare exceptions), without their code quality being significantly better.
(I end up in 1:1 interviews with perhaps 10-15% of candidates who take round 1).
nsonha
if it's real that person interviewed at least one candidate per day last year. Idk what kind of engineering role in what kind of org where you even do that.
jacobedawson
I'd add to that that the best results are with clear spec sheets, which you can create using Claude (web) or another model like ChatGPT or Grok. Telling them what you want and what tech you're using helps them create a technical description with clear segments and objectives, and in my experience works wonders in getting Claude Code on the right track, where it has full access to the entire context of your code base.
cheema33
> The difference between those who are using LLMs effectively and those who are not is stark.
Same here. Most candidates I interviewed said they did not use AI for development work. And it showed. These guys were not well informed on modern tooling and frameworks. Many of them seemed stuck in/comfortable with their old way of doing things and resistant to learning anything new.
I even hired a couple of them, thinking that they could probably pick up these skills. That did not happen. I learned my lesson.
pm215
Isn't that more correlation than causation, though? The kind of person who's not keeping up with the current new tech hotness isn't going to be looking at AI or modern frameworks; and conversely the kind of person who's dabbling with AI is also likely to be looking at other leading-edge tech stuff in their field. That seems to me more likely to be the cause of what you're seeing than that the act of using AI/LLMs itself resulting in candidates improving their knowledge and framework awareness.
InvertedRhodium
My workflow for that kind of thing goes something like this (I use Sonnet 3.7 Thinking in Cursor):
1. 1st prompt is me describing what I want to build, what I know I want and any requirements or restrictions I'm aware of. Based on these requirements, ask a series of questions to produce a complete specification document.
2. Workshop the specification back and forward until I feel it's complete enough.
3. Ask the agent to implement the specification we came up with.
4. Tell the agent to implement Cursor Rules based on the specifications to ensure consistent implementation details in future LLM sessions.
I'd say it's pretty good 80% of the time. You definitely still need to understand the problem domain and be able to validate the work that's been produced but assuming you had some architectural guidelines you should be able to follow the code easily.
The Cursor Rules step makes all the difference in my experience. I picked most of this workflow up from here: https://ghuntley.com/stdlib/
Edit: A very helpful rule is to tell Cursor to always checkout a new branch based on the latest HEAD of master/main for all of it's work.
theshrike79
I need to steal the specification idea.
Cursor w/ Claude has a habit of running away on tangents instead of solving just the one problem, then I need to reject its changes and even roll back to a previous version.
With a proper specification as guideline it might stay on track a bit better.
prettyblocks
Copilot supports this somewhat natively:
https://docs.github.com/en/copilot/customizing-copilot/addin...
The first thing I do for a new project is ask Copilot to create a custom-instructions.md for me and then as I work on my projects, I ask it to update the instructions every now and then based on the current state of my project.
Much less misses this way in my experience.
slooonz
I decided to try seriously the Sonnet 3.7. I started with a simple prompt on claude.ai ("Do you know claude code ? Can you do a simple implementation for me ?"). After minimal tweaking from me, it gave me this : https://gist.github.com/sloonz/3eb7d7582c33e95f2b000a0920016...
After interacting with this tool, I decided it would be nice if the tool could edit itself, so I asked (him ? it ?) to create its next version. It came up with a non-working version of this https://gist.github.com/sloonz/3eb7d7582c33e95f2b000a0920016.... I fixed the bug manually, but it started an interactive loop : I could now describe what I wanted, describe the bugs, and the tool will add the features/fix the bugs itself.
I decided to rewrite it in Typescript (by that I mean: can you rewrite yourself in typescript). And then add other tools (by that: create tools and unit tests for the tools). https://gist.github.com/sloonz/3eb7d7582c33e95f2b000a0920016... and https://gist.github.com/sloonz/3eb7d7582c33e95f2b000a0920016... have been created by the tool itself, without any manual fix from me. Setting up the testing/mock framework ? Done by the tool itself too.
In one day (and $20), I essentially had recreated claude-code. That I could improve just by asking "Please add feature XXX". $2 a feature, with unit tests, on average.
WD-42
So you’re telling me you spent 20 dollars and an entire day for 200 lines of JavaScript and 75 lines of python and this to you constitutes a working re-creation of Claude Code?
This is why expectations are all out of whack.
BeetleB
That amount of output is comparable to what many professional engineers produce in a given day, and they are a lot more expensive.
Keep in mind this is the commenters first attempt. And I'm surprised he paid so much.
Using Aider and Sonnet I've on multiple occasions produced 100+ lines of code in 1-2 hours, for under $2. Most of that time is hunting down one bug it couldn't fix by itself (reflective of real world programming experience).
There were many other bugs, but I would just point out the failures I was seeing and it would fix it itself. For particularly difficult bugs it would at times even produce a full new script just to aid with debugging. I would run it and it would spit out diagnostics which I fed back into the chat.
The code was decent quality - better than what some of my colleagues write.
I could probably have it be even more productive if I didn't insist on reading the code it produced.
slooonz
2200 lines. Half of them unit tests I would probably have been too lazy to write myself even for a "more real" project. Yes, I consider $20 cheap for that, considering:
1. It’s a learning experience 2. Looking at the chat transcripts, many of those dollars are burned for stupid reasons (Claude often fails with the insertLines/replaceLines functions and break files due to miss-by-1 offset) that are probably fixable 3. Remember that Claude started from a really rudimentary base with few tools — the bootstrapping was especially inefficient
Next experiment will be on an existing codebase, but that’s probably for next weekend.
null
Silhouette
Thanks for writing up your experience and sharing the real code. It is fascinating to see how close these tools can now get to producing useful, working software by themselves.
That said - I'm wary of reading too much into results at this scale. There isn't enough code in such a simple application to need anything more sophisticated than churning out a few lines of boilerplate that produce the correct result.
It probably won't be practical for the current state of the art in code generators to write large-scale production applications for a while anyway just because of the amount of CPU time and RAM they'd need. But assuming we solve the performance issues one way or another eventually it will be interesting to see whether the same kind of code generators can cope with managing projects at larger scales where usually the hard problems have little to do with efficiently churning out boilerplate code.
NitpickLawyer
aider has this great visualisation of "self written code" - https://aider.chat/HISTORY.html
matt_heimer
LLM are replacing Google for me when coding. When I want to get something implemented, let's say make a REST request in Java using a specific client library, I previously used Google to find example of using that library.
Google has gotten worse (or the internet has more garbage) so finding code an example is more difficult than it used to be. Now I ask an LLM for an example. Sometimes I have to ask for a refinement and and usually something is broken in the example but it takes less time to get the LLM produced example to work than it does to find a functional example using Google.
But the LLM has only replaced my previous Google usage, I didn't expect Google to develop my applications and I don't with LLMs.
ptmcc
This has been my experience of successful usage as well. It's not writing code for me, but pulling together the equivalent of a Stack Overflow example and some explaining sentences that I can follow up on. Not perfect and I don't blindly copy paste it, same as Stack Overflow ever was, but faster and more interactive. It's helpful for wayfinding, but not producing the end result.
deergomoo
I used the Kagi free trial when I was doing Advent of Code in a somewhat unfamiliar language (Swift) last year, as well as ChatGPT occasionally.
The LLM was obviously much faster and the information was much higher density, but it had quite literally about a 20% rate of just making up APIs from my limited experiment. But I was very impressed with Kagi’s results and ended up signing up, now using it as my primary search engine.
heisenbit
It is really a double edged sword. Some APIs I would not have found myself. In some way an AI works like my mind fuzzy associating memory fragements: There should be an option for this command to do X because similar commands have this option and it would be possible to provide this option. But in reality the library is less than perfectly engineered and the option is not there. The AI also guesses the option is there. But I do not need a guess when I ask the AI - I need reliable facts. If the cost of an error is not high I still ask the AI and if it fails it is back to RTFM but if the cost of failure is high then everything that comes out of a LLM needs checking.
theshrike79
I did the Kagi trial in the fall of 2023 and tried to hobble along with the cheapest tier.
Then I got hooked by having a search engine that actually finds the stuff I need and I've been a subscriber for bit over a year now.
Wouldn't go back to Google lightly.
layer8
In order to use a library, I need to (this is my opinion) be able to reason about the library’s behavior, based on a specification of its interface contract. The LLM may help with coming up with suitable code, but verifying that the application logic is correct with respect to the library’s documented interface contract is still necessary. It’s therefore still a requirement to read and understand the library’s documentation. For example, for the case of a REST client, you need to understand how the possible failure modes of the HTTP protocol and REST API are translated by the library.
jayd16
I wonder how good Google could be if they had a charge per query model that these LLMs do. AI or not, dropping the ad incentive would be nice.
escapecharacter
I've found AI to be useful on precisely-scoped tasks I might assign to a junior programmer to take a day to do, like "convert this exact bash script to a Powershell script".
But in my own work, those tasks are pretty rare, like 3 times a month? Often I start working on something, and the scope and definition of success changes while I'm in the midst of it. Or it turns out to be harder than expected and it makes sense to timebox it and do a quick search for workarounds.
As much as we joke about StackOverflow commenters sometimes telling a question-asker they shouldn't be doing what they're trying to do, you do actually want that (soft) pushback some of the time. Most modern LLMs will gleefully come up with a complete plan for how we're gonna get the square peg in the round hole.
philipswood
> you do actually want that (soft) pushback some of the time. Most modern LLMs will gleefully come up with a complete plan for how we're gonna get the square peg in the round hole.
I once accidentally asked a local DeepSeek distilled model to do the wrong thing by accidentally copy pasting the wrong variable name.
It told me how to do it, and then asked me if I was sure.
My local DeepSeek R1 model (deepseek-r1:7b) saw me trying to do something stuupid (I was working with the wrong variable). It told me how to do what I asked and then asked:
> _Is this modification part of a larger change you're making to the code? I'd like to make sure we're not modifying something important that might have side effects._
Looking at its though process:
> _The user wants to modify this string by replacing "-input" with "-kb". But looking at the ARN structure, I recall that AWS S3 ARNs are usually in the form arn<:aws:1151472526310103070>s3:::bucket_name RegionalPart path. The part after the bucket name is typically s3:// followed by the object key._ > _Wait, maybe the user has a specific reason to make this change. Perhaps they're formatting or structuring something for a different purpose. They might not realize that directly modifying ARNs can cause issues elsewhere if not done correctly._
escapecharacter
That's nice!
null
CGamesPlay
I decided to check this out after seeing the discussion here. I had previously misunderstood that it required a Claude.ai plan, but it actually just uses your API keys.
I did a comparison between Claude Code and Aider (my normal go-to): I asked it to do clone a minor feature in my existing app with some minor modifications (specifically, a new global keyboard shortcut in a Swift app).
Claude Code spent about 60 seconds and $0.73 to search the code base and make a +51 line diff. After it finished, I was quite impressed by its results; it did exactly the correct set of changes I would have done.
Now, this is a higher level of task than I would normally give to Aider (because I didn't provide any file names, and it requires changing multiple files), so I was not surprised that Aider completely missed the files it needed to modify to start (asking me to add 1 correct file and 2 incorrect files). I did a second attempt after manually adding the correct files. After doing this, it produced an equivalent diff to Claude Code. Aider did this in 1 LLM prompt, or about 15 seconds, with a total cost of $0.07, about 10% of the Claude Code cost.
Overall, it seems clear that the higher level of autonomy carries a higher cost with it. My project here was 7k SLOC; I would worry about ballooning costs on much larger projects.
bufferoverflow
Now, the billion dollar question is - how long would it take you to code that diff?
CGamesPlay
Probably about 3 minutes? That's my main usage of these types of coding tools, honestly. I already know generally what I want to happen and validating that the LLM is on the right track / reached the right solution is easy.
I'm not saying that "Claude Code makes me 300% more effective", but I guess it did for this (simple) task.
generic92034
Are you comparing only the generation time with your coding time? How are the figures if you include the necessary step of checking the generated code? And how do these times change if you are coding in a very complex environment?
off_by_inf
And how long would it take you to debug AI generated code again and again?
ddacunha
I recently made some changes to a website generated by a non-technical user using Dreamweaver. The initial state was quite poor, with inline CSS and what appeared to be haphazard copy-pasting within a WYSIWYG editor.
Although I’m not proficient in HTML, CSS, or JavaScript, I have some understanding of what good code looks like. Through several iterations, I managed to complete the task in a single evening, which would have required me a week or two to relearn and apply the necessary skills. Not only the code is better organised, it’s half the size and the website looks better.
fragmede
time spent is not the only question. how much thought it takes, however impossible that may be to measure, is another one. If an LLM assisted programmer is able to solve the problem without deep focus, while responding to emails and attending meetings, vs the programmer who can't, is time really the only metric we can have here?
null
ignoramous
> how long would it take you to code that diff?
My scrum/agile coach says, by parallelizing prompts, a single developer can babysit multiple changes in the same time slice. By having a sequence of prompts ready before hand, a single developer can pipeline those one after the other. With an IDE that helps schedule such work, a single developer can effectively hyper-thread their developmental workflow. If the developer is epoll'ing at 10x the hertz... that's another force multiplier. Of course context switches & side-channels are of concern, but a voice over my shoulder tells me that as long as memory safety is guaranteed, everything should turn up alrigd3adb33f.
throwawaymsft
Infinite, because the median counterfactual is never getting around to this P4 “nice to have” issue that’s languished in the backlog.
artdigital
Same here, I did a few small taks with Claude Code after seeing this discussion here and is too expensive for me.
A small change to create a script file (20 LoC) was 10cts, a quick edit to a README was 7ct
Yes yes engineers make more than that blah blah but the cost would quickly jump out of control for bigger tasks. I’d easy burn through $10-20 upwards a day with this, or upwards $100-$300 a month. Unless you have a Silicon Valley salary, that’s too expensive.
I use other tools like Cody (the tool the author created) or Copilot because I pay $10 a month and that’s it. Yes I get rate limited almost daily but I don’t need to worry that my tool cost is going out of control suddenly.
I hope Anthropic introduces a new plan that bundles Claude Code into it, I’d be much more comfortable using that knowing it won’t suddenly be more than my $50/mo (or whatever)
stevage
It's an interesting question. As a freelance consultant, theoretically a tool like this could allow me to massively scale up my income, assuming I could find enough clients.
I'm a bit nervous where I'd end up though - with code I'd "written" but wasn't familiar with, and with who knows what kinds of limitations or subtle bugs baked in.
firtoz
I currently pay around $200-300 to a combination of Cursor + Anthropic through the API. I have both a full time job and freelance work. It pays for itself. I end up reviewing more than manual coding, to ensure the quality of the results. Funnily, the work I did through this method has received more praise than my usual work.
leoedin
> I'm a bit nervous where I'd end up though - with code I'd "written" but wasn't familiar with
This does seem like quite a big downside. It turns every new feature into “implement this in someone else’s code base”. I imagine you’d very quickly have complete dependency on the AI. Maybe that’s an inevitability in this new world?
bayarearefugee
> Yes yes engineers make more than that blah blah but the cost would quickly jump out of control for bigger tasks.
Also (most) engineers don't hallucinate answers. Claude still does regularly. When it does it in chat mode via a flat rate Pro plan I can laugh it off and modify the prompt to give it the context it clearly didn't understand but if its costing me very real money for the LLM to over-eagerly over-engineer an incorrect implementation of the stated feature its a lot less funny.
artdigital
Exactly! Especially agentic tools like Aider and Claude that are designed to pull in more files into their context automatically, based on what the LLM thinks it should read. That can very quickly go out of control and result in huge context windows.
Right now with Copilot or other fixed subscriptions I can also laugh it off and just create a new tab with fresh context. Or if I get rate-limited because of too much token use I can wait 1 day. But if these actions are linked to directly costing money on my card, then that's becoming a lot more scary.
naasking
> Also (most) engineers don't hallucinate answers.
They absolutely do, where do you think bugs come from? The base rate is typically just lower than current AIs.
zo1
I use Grok and it's free (even Grok3). I definitely don't hit limits unless it's a pretty heavy day and I do a lot of adjustments. Also, don't send entire codebases to it, just one-off files. What's quite amazing is how it doesn't matter that it doesn't have the source to dependent files, it figures it out and infers what each method does based on its name and context, frigging amazing if you ask me.
And it doesn't fight me like the OpenAI tooling does that logs me out randomly every day and I have to login and spend 4 minutes copying login codes from my email or answering their stupid Captcha test. And this is on their API playground where I pay for every single call - so not like I'm trying to scrape my free chat usage as an API.
artdigital
> I use Grok and it's free (even Grok3). I definitely don't hit limits unless it's a pretty heavy day and I do a lot of adjustments
Okay maybe need to clarify: I hit those limits when I do agentic stuff, which is what Claude Code does: So let the LLM automatically pull in files into the context it thinks it needs, analyze my codebase, follow imports, add more code, etc. It can quickly balloon out of control when the LLM pulls in too many LoC and the context window gets too big.
Then do a few back and forth actions like "let's refine this plan, instead of X pls do Y", or "hmm I think maybe we should also look into file blah.ts" and you quickly hit 500k tokens.
If I use Cody only, which has some agentic capabilities but is much more "how can I implement Y in this file @src/file1.ts db models are in @src/models/foo.ts", then I rarely ever hit any rate limitations. That's more similar to what you describe of copying code back and forth, except it's in the editor and you can do it by writing @somefile.
immibis
You send your company's trade secrets directly to Elon Musk?
lolinder
I think that tools like this have to operate on a subscription model like Cursor does in order to make any kind of sense for most users. The pay as you go model for agentic code tools makes you responsible for paying for:
* Whatever context the agent decides to pull in.
* However many iterations the model decides to run.
* Any result you get, regardless of how bad it is.
With pay as you go, the tool developer has no incentive to minimize any of these costs—they get paid more if it's inefficient, as long as it's not so inefficient that no one uses it. They don't even need it to be especially popular, they just need some subset of the population to decide that costs don't matter (i.e. those with Silicon Valley salaries).
With Cursor's model of slow and fast queries, they are taking responsibility for ensuring that the agents are as cost efficient as possible. The more efficient the agent the larger their cut. The fewer times that people have to ask a question a second time, the larger their cut. This can incentivize cutting corners, but that somewhat balanced out by the need to keep people renewing their subscription, and on the whole for most users it's better to have a flat subscription price and a company that's optimizing their costs than to have a pay-as-you-go model and a company that has no incentive to improve efficiency.
foz
I think this core business model question is happening at all levels in these companies. Each time the model goes in the wrong direction, and I stop it - or I need to go back and reset context and try again - I get charged. The thing is, this is actually a useful and productive way to work sometimes. Like when pairing with another developer, you need to be able to interrupt each other, or even try something and fail.
I don't mind paying per-request, but I can't help but think the daily API revenue graph at Cursor is going up whenever they have released a change that trips up development and forces users to intervene or retry things. And of course that's balanced by churn if users get frustrated and stop or leave. But no product team wants to have that situation.
In the end I think developers will want to pay a fair and predictable price for a product that does a good job most of the time. I don't personally care about switching models, I tend to gravitate towards the one that works best for me. Eventually, I think most coding models will soon be good at most things and the prices will go down. Where will that leave the tool vendors?
cft
Unfortunately the opposite is happening: Cursor is going to pay-per-use model:
https://x.com/ericzakariasson/status/1898753771754434761
https://old.reddit.com/r/cursor/comments/1j5kvun/cursor_0470...
I am afraid that the endgame of programming will be who has the biggest budget for an LLM, further consolidating programming to megacorps and raising barrier to entry.
personjerry
That seems like a steal? Engineers are paid much more to do much less
CGamesPlay
No, I'm paid much more to do much more than what I did in this simple task. Claude didn't even test the changes (in this case, it does not have the hardware required to do that), or decide that the feature needed to be implemented in the first place. But my comparison wasn't "how do I compare to Claude Code", it was "how does Aider compare to Claude Code". My boss does not use Aider or Claude Code, and would not be happy with the results of replacing me with it (yet).
Sterling9x
[flagged]
y1n0
I think you missed the point. The claim is that it would have cost you more to hire a person to do what either claude code or aider did.
beepbooptheory
I know this is not really in the spirit of the room here, but before I ever dreamed of getting paid to code, I only learned to at all because I was cheap and/or poor cook/grad student that wanted to make little artsy musical things on the computer. I remember the first time I just downloaded pure data. No torrent, no cracks, it was just there for me and all it asked for was my patience.
The only reason I ever got into linux at all was because I ended up with some dinky acer chromebook for school but didn't want to stop making stuff. Crouton changed my life in a small way with that.
As I branched out and got more serious, learning web development, emacs, java, I never stopped feeling so irrationally lucky that it was all free, and always would be. Coming on here and other places to keep learning. It is to this day still the lovely forever hole I can excavate that costs only my sleep and electricity.
This is all not gone, but if I was just starting now, I'd find hn and so and coding twitter just like I did 10 years ago, but would be immediately turned off by this pervasive sense that "the way to do things now" is seemingly inseparable from a credit card number and monthly charge, however small. I just probably would not of gotten into it. It just wouldn't feel like its for me: "oh well I don't really know how to do this anyway, I can't justify spending money on it!" $0.76 for 50 loc is definitely nuts, but even $0.10 would of turned me way off. I had the same thoughts with all the web3 stuff too...
I know this speaks more to my money eccentricities than anything, and I know we dont really care on here about organic weirdo self teachers anymore (just productivity I guess). I am truly not even bemoaning the present situation, everyone has different priorities, and I am sure people are still having the exciting discovery of the computer like I did on their cursor ide or whatever. But I am personally just so so grateful the timeline lined up for me. I don't know if I'd have my passion for this stuff if I was born 10 years later than I was, or otherwise started learning now. But I guess we don't need the passion anymore anyway, its all been vectorized!
cwalv
> But I am personally just so so grateful the timeline lined up for me.
I know the feeling. We still have access to the engineering thought processes responsible for some of the most amazing software feats ever accomplished (thru source repo history and mailing lists), just with access to the Internet. Of course there's a wealth of info available for free on the web for basically any profession, but for software engineering in particular it's almost direct access to world class teams/projects to learn from.
> but would be immediately turned off by this pervasive sense that "the way to do things now" is seemingly inseparable from a credit card number and monthly charge
To be effective you still need to understand and evaluate the quality of the output. There will always be a certain amount of time/effort required to get to that point (i.e., there's still no silver bullet).
> But I guess we don't need the passion anymore anyway, its all been vectorized!
We're not running out of things that can be improved. With or without these tools, the better you get, the more of the passion/energy that gets directed at higher levels of abstraction, i.e. thinking more about what to solve, tradeoffs in approaches, etc. instead of the minute details of specific solutions.
idiotsecant
This doesn't make much sense to me. Is there some reason a kid today can't still learn to code? In the contrary, you have LLMs available that can answer your exact personalized questions. It's like having a free tutor. It's easier than it's ever been to learn for free.
JKCalhoun
The sounds like the best way to get into coding. (For me it was wanting to realize game ideas to entertain myself.)
Money for a computer when I was getting into it was the credit-card part of it — there were no cheap Chromebooks then. (A student loan took care of the $1200 or so I needed for a Macintosh Plus.)
I suspect that's always the way of it though. There will be an easier way throwing money at a thing and there will be the "programming finds a way" way.
baq
It must be said, preferably in bold, that:
> this pervasive sense that "the way to do things now" is seemingly inseparable from a credit card number and monthly charge
…is true, but it only applies to experienced engineers who can sculpt the whole solution using these tools, not just random code. You need the whole learning effort to be able to ground the code the slop generators make. The passion absolutely helps here.
Note this is valid today. I have concerns that I’ll have different advice in 2027…
Bjorkbat
Tangential, but this reminds me of something someone said on Twitter that has resonated with me ever since. Startups targeting developers / building developer tooling are arguably one of the worst startups to build, because no matter how much of a steal the price is relative to the value you get, developers insist they can build their own or get by with an open-source competitor. We're as misguided on value as we are on efficiency and automation (more specifically, the old trope of a dev spending hours to automate something that takes minutes to do).
deepGem
This is also why devs are not in charge of purchase decisions at tech companies. I don't mean FAANG but the usual tech shops. Someone buys a tool and you have to use it. I think the startups selling dev tools are not selling to developers at all, but to the IT folks of these big tech firms.
Should they pull it off, it's not at all a bad startup to build. However, you need to now invest in a sales force that can sell to the Fortune 500. As a tech founder with no sales trope, this will be incredibly hard to pull off.
I digress, but yeah selling to devs is almost always a terrible idea since we all want to build our own stuff. That spirit may also be waning with the advent of Replit agent, Claude code and other tools.
cwalv
I've noticed this tendency in myself and thought about the 'why' a lot, and I think it comes down to subconsciously factoring in the cost of lock-in, or worse, lack of access to fix/improve a tool I've come to rely on
bryanrasmussen
>We're as misguided on value as we are on efficiency and automation (more specifically, the old trope of a dev spending hours to automate something that takes minutes to do).
but automating something that takes minutes to do is Larry Wall's example of programmer laziness, and is a virtue.
of course - this needs obligatory conflicting XKCD comics
automation makes you do more work: https://xkcd.com/1319/
is it worth the time to automate https://xkcd.com/1205/
tropin
Not everybody works in USA.
caseyf7
Which model did you use with aider?
CGamesPlay
My post above was with sonnet-3.5. When I used sonnet-3.7, it didn't speculate at the files at all; it simply requested that I add the appropriate ones.
maineagetter
[dead]
chaosprint
It seems the original poster hasn't extensively tried various AI coding assistants like Cursor or Windsurf.
Just a quick heads-up based on my recent experience with agent-based AI: while it's comfortable and efficient 90% of the time, the remaining 10% can lead to extremely painful debugging experiences.
In my view, the optimal scenarios for using LLM coding assistants are:
- Architectural discussions, effectively replacing traditional searches on Google.
- Clearly defined, small tasks within a single file.
The first scenario is highly strategic, the second is very tactical. Agents often fall awkwardly between these two extremes. Personally, I believe relying on an agent to manage multiple interconnected files is risky and counterproductive for development.
sitkack
Steve Yegge, you should know who he is AND the post also mentioned Cursor and Windsurf. His own company works on a similar product.
hashmap
This has been my experience as well. I find that the copy/paste workflow with a browser LLM still gets me the most bang for the buck in both those cases. The cli agents seem to be a bit manic when they get hold of the codebase and I have a harder time corralling them into not making large architectural changes without talking through them first.
For the moment, after a few sessions of giving it a chance, I find myself using "claude commit" but not asking it to do much else outside the browser. I still find o1-pro to be the most powerful development partner. It is slow though.
tomnipotent
The author works on Cody at Sourcegraph so I'll give him the benefit of the doubt that he's tried all the major players in the game.
mvdtnz
He's also author of one of the most legendary posts about programming language design of all time, Execution in the Kingdom of Nouns.
http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom...
finolex1
He literally says in his post "It might look antiquated but it makes Cursor, Windsurf, Augment and the rest of the lot (yeah, ours too, and Copilot, let's be honest) FEEL antiquated"
sbszllr
> In my view, the optimal scenarios for using LLM coding assistants are:
> - Architectural discussions, effectively replacing traditional searches on Google.
> - Clearly defined, small tasks within a single file.
I think you're on point here, and it has been my experience too. Also, not limited to coding but general use of LLMs.
bdangubic
duuuuude :) you should seriously consider deleting this post… if you do not know who Steve Yegge is (the original poster as you call him) you really should delete this post
intrasight
> extremely painful debugging experiences.
I'd claim that if you're debugging the code - or even looking at it for that matter - that you're using AI tools the wrong way.
chaosprint
I'd be very interested to know of a way to make it work with AI that doesn't require debugging if you can illustrate.
intrasight
Make what work with AI?
vunderba
Congratulations. You allow the AI to make some new subroutine, and you immediately commit and merge the changes to your system. You run it, and it executes without throwing any immediate errors.
The business domain is far more nuanced and complex, and your flimsy "does it compile" test for the logic doesn't even begin to cover the necessary gamut of the input domain which you as the expert might have otherwise noticed had you performed even a cursory analysis of the LLM generated code before blindly accepting it.
Nice to know that I'm going to be indefinitely employed fixing this kind of stuff for decades to come...
intrasight
>You allow the AI to make some new subroutine
Again, you're using AI the wrong way.
collingreen
This is exactly my impression of the summary of these kinds of posts and, I'm speculating here, maybe where there is such a stark difference.
I'm guessing that the folks who read the output and want to understand it deeply and want to "approve" it like a standard pull request are having a very different perspective and workflow than those who are just embracing the vibe.
I do not know if one leads to better outcomes than the other.
esafak
Are you serious? Why not just vibe work with your human coworkers and merge to master then? Let's see what the outcome is!
inciampati
I'm impressed by how many people who are working with Claude Code seem to have never heard of its open source inspiration, aider: https://aider.chat/
It's exactly what Yegge describes. It runs in the terminal, offering a retro vision of the most futuristic thing you can possibly be doing today. But that's been true since the command line was born.
But it's more than Claude Code, in that it's backend LLM agnostic. Although sonnet 3.7 with thinking _is_ killing the competition, you're not limited to it, and switching to another model or API provider is trivial, and something you might do many times a day even in your pursuit of code that vibes.
I've been a vim and emacs person. But now I'm neither. I never open an editor except in total desperation. I'm an aider person now. Just as overwhelmingly addicted to it as Yegge is to Claude Code. May the future tooling we use be open source and ever more powerful.
Another major benefit of aider is its deep integration with git. It's not "LLM plus coding" it's really like an interface between any LLM (or LLMs), git, and your computer. When things go wrong, I git branch, reset to a working commit, and continue exploring.
note: cross-posted from the other (flagged) thread on this
victorbjorklund
Aider is amazing. You can even use it in copypaste mode with web-based AI:s.
rs186
A single tweet with lots of analogy, with no screenshot/screen recording/code examples whatsoever. These are just words. Are we just discussing programming based on vibe?
delusional
It's influencer culture. It's like when people watch those "software developer" youtubers and pretend it's educational. It's reality television for computer people.
mpalmer
Reality television plus cooking show, exactly.
macNchz
Cooking shows are a perfect analogy for this stuff. For some reason I never connected the highly-edited-mass-appeal "watch someone do skilled work" videos on YouTube with Food Network style content until just now, but you're right they're totally scratching the same basic itch. They make people feel like they're learning something just by watching, while there is really no substitute for actually just doing the thing.
tylerrobinson
> reality television for computer people
Complete with computer people kayfabe!
frankc
I think the interest has more to do with who is doing the tweeting, don't you think?
rs186
Reminder: "appeal to authority" is a classical logical fallacy.
For me, I don't know this person, which means that all the words are completely meaningless.
Which is exactly the point.
frankc
I don't think it's really an appeal to authority. No one is saying it must be true because this guy says its true. However, when a very respected figure in software engineer culture like Steve Yegge gives an opinion, that is more noteworthy than when random joe schmo gives his opinion. The fact that you don't know who he is means it's not interesting to you. Clearly, it's noteworthy to others.
JohnKemeny
Exception: Be very careful not to confuse "deferring to an authority on the issue" with the appeal to authority fallacy. Remember, a fallacy is an error in reasoning. Dismissing the council of legitimate experts and authorities turns good skepticism into denialism. The appeal to authority is a fallacy in argumentation, but deferring to an authority is a reliable heuristic that we all use virtually every day on issues of relatively little importance. There is always a chance that any authority can be wrong, that’s why the critical thinker accepts facts provisionally. It is not at all unreasonable (or an error in reasoning) to accept information as provisionally true by credible authorities. Of course, the reasonableness is moderated by the claim being made (i.e., how extraordinary, how important) and the authority (how credible, how relevant to the claim).
Quotation from https://www.logicallyfallacious.com/logicalfallacies/Appeal-...
baq
When Karpathy and Yegge speak, I listen. When they say approximately the same things, I listen carefully.
tymscar
I agree with you that the tweet is basically useless but what they have done there is not an appeal to authority fallacy. They have only explained why it’s popular. Arguably your response is a genetic fallacy.
theonething
> For me, I don't know this person, which means that all the words are completely meaningless.
Because you don't know who Yegge is, the words are meaningless to you? So a body of text is meaningful only if you know who the author is? That's...lame.
null
mhh__
> vibe
People do literally call it vibe coding.
https://en.wikipedia.org/wiki/Vibe_coding (it turns out there is a wikipedia page already although suspect it'll be gone soon)
Bjorkbat
I'm amused by all the flags this article has. It reinforces this belief that "vibe-coding" isn't something that evolved organically, but was forced. I wouldn't go as far as to call it "astroturfed", I believe it was a spontaneous emergence, but otherwise it feels like an effort by a disproportionately small group of people desperately trying to make vibe-coding a thing for whatever reason.
mhh__
it's definitely more organic than you think. I know smart productive people doing it as purely a 0 to 1 thing.
rglover
This is quite possibly one of the most disturbing things I've seen in awhile.
Sure, for fun and one-off private apps this is fine, but the second that some buzzed-on-the-sides haircut guy thinks this is the way, the chaos will rival a Hollywood disaster movie.
leptons
>"The practice defies the belief in the software industry that software engineering demands great skill."
But actual software engineering does demand great skill. What "vibe coders" are doing isn't engineering. It's not even programming. It's like calling yourself a chef when you microwaved a frozen meal.
raylad
I tried it on a small Django app and was not impressed in the end.
It looks like it’s doing a lot, and at first I was very impressed, but after a while I realized that when it ran into a problem it kept on trying nonworking strategies even though it had tried them before and I had added to claude.md instructions to keep track of strategies and not reuse failing ones.
It was able to make a little progress, but not get to the end of the task, and some of its suggestions were completely insane. At one point there was a database issue and it suggested switching to an entirely different database than the one that was already used by the app, which was working and production.
$12 spent in a couple of hours later, it had created 1200 lines of partially working code and rather of a mess. I ended up throwing away all the changes and going back to using the web UI.
babyent
Now take your $12 and multiply it by 100k people or more trying it.
Even if you won’t use it again, that’s booked revenue for the next fundraise!
bufferoverflow
That's revenue, not profit. GPUs aren't cheap.
nprateem
I use it like a brush for new apps and a scalpel for existing ones and it generally works well. If it can't solve something after 3 attempts though I just do it.
winrid
LLMs seem to work a lot better with statically typed languages where the compiler can give feedback.
jimbokun
LLMs are what’s finally going to make Haskell popular!
tome
And Haskell is going to make LLMs popular! https://groq.com/
phartenfeller
I tried it too and tasked it to do a bigger migration (one web framework to another). It failed pretty bad where I stopped the experiment. It still gave me a headstart where I can take parts and continue the migration manually. But the worst thing was that it did things I didn't asked for like changing the HTML structure and CSS of pages and changing hand picked HEX color codes...
More about my experience on my blog: https://hartenfeller.dev/blog/testing-claude-code
ludamn
Such a nice read, thanks for sharing!
ing33k
I tried Claude Code and gave it very clear instructions to build a web based tool I wanted to build over the weekend. It did exactly that ! sure, there were some minor modifications I had to make, but it completed over 80% of the work for me.
As for the app itself, it included a simple UI built on React with custom styling and real-time support using WSS . I provided it with my brand colors and asked it to use chadcn. It also includes a Node.js-based backend with socket.io and puppeteer. I even asked it to generate a Dockerfile and Kubernetes manifests. It almost did a perfect job—the only thing I had to fix manually was updating my Ingress to support WSS.
After studying the K8s manifests, I learned a bunch of new things as well.. spent around $6 for this session and I felt that it was worth it.
bob1029
I find that maintaining/developing code is not an ideal use case for LLMs and is distracting from the much more interesting ones.
Any LLM application that relies more-or-less on a single well-engineered prompt to get things done is entry level and not all that impressive in the big picture - 99% of the heavy lifting is in the foundation model and next token prediction. Many code assistants are based on something like this out of necessity of needing to support anybody's code. You can't rely on too many clever prompt chaining patterns to build optimizations for Claude Code because everyone takes different approaches to their codebase and has wildly differing expectations for how things should go down. Because the range of expectations is so vast, there is a lot of room to get disappointed.
The LLM applications that are most interesting have the model integrated directly with the product experience and rely on deep domain expertise to build sophisticated chaining of prompts, tool calling and nesting of conversations. In these applications, the user's experience and outcomes are mostly predetermined with the grey areas intended to be what the LLM is dealing with. You can measure things and actually do something about it. What was the probability of calling one tool over the other in a specific context of use? Placing these prompts and statistics alongside domain requirements will enable you to see and make a difference.
hleszek
I must have been a little too ambitious with my first test with Claude Code.
I asked it to refactor a medium-sized Python project to remove duplicated code by using a dependency injection mechanism. That refactor is not really straightforward as it involves multiple files and it should be possible to use different files with different dependencies.
Anyway, I explain the problem in a few lines and ask for a plan of what to do.
At first I was extremely impressed, it automatically used commands to read the files and gave me a plan of what to do. It seemed it perfectly understood the issue and even proposed some other changes which seemed like a great idea.
So I just asked him to proceed and make the changes and it started to create folders and new files, edit files, and even run some tests.
I was dumbfounded, it seemed incredible. I did not expect it to work with the first try as I had already some experience with AI making mistakes but it seemed like magic.
Then once it was done, the tests (which covered 100% of the code) were not working anymore.
No problem, I isolate a few tests failing and ask Claude Code to fix it and it does.
Now for a few times I found some failing tests and ask him to fix it, slowly trying to fix the mess until there is a test which had a small problem: it succeeded (with pytest) but froze at the end of the test.
I ask again Claude Code to fix it and it tries to add code to solve the issue, but nothing works now. Each time it adds some bullshit code and each time it fails, adding more and more code to try to fix and understand the issue.
Finally after $7,5 spent and 2000+ lines of code changed it's not working, and I don't know why as I did not make the changes.
As you know it's easier to write code than to read code so at end I decided to scrape everything and do all the changes myself little by little, checking that the tests keep succeeding as I go along. I did follow some of the recommended changes it proposed tough.
Next time I'll start with something easier.
jpc0
Really yoy nearly got the correct approach there.
I generally follow the same approach these days, ask it to develop a plan then execute but importantly have it excute each step in as small increments as possible and do a proper code review for each step. Ask if for changes you want it to make.
There is certainly times I need to do it myself but definitely this has improved some level of productivity for me.
It's just pretty tedious so I generally write a lot of "fun" code myself, and almost always do the POC myself then have the AI do the "boring" stuff that I know how to do but really don't want to do.
Same with docs, the modern reasoning models are very good at docs and when guided to a decent style can really produce good copy. Honestly R1/4o are the first AI I would actually concider pulling into my workflow since they make less mistakes and actually help more than they harm. They still need to be babysit though as you noticed with Claude.
UncleEntity
> ...do all the changes myself little by little, checking that the tests keep succeeding as I go along.
Or... you can do that with the robots instead?
I tried that with the last generation of Claude, only adding new functionality when the previously added functionality was complete, and it did a very good job. Well, Claude for writing the code and Deepseek-R1 for debugging.
Then I tried a more involved project with apparently too many moving parts for the stupid robots to keep track of and they failed miserably. Mostly Claude failed since that's where the code was being produced, can't really say if Deepseek would've fared any better because the usage limits didn't let me experiment as much.
Now that I have an idea of their limitations and had them successfully shave a couple yaks I feel pretty confident to get them working on a project which I've been wanting to do for a while.
darkerside
I'm curious for the follow up post from Yegge, because this post is worthless without one. Great, Claude Code seems to be churning out bug fixes. Let's see if it actually passes tests, deploys, and works as expected in production for a few days if not weeks before we celebrate.
pchristensen
He posts a few times a year at https://sourcegraph.com/blog
elcomet
I'm wondering if you can prompt it to work like this - make minimal changes, and run the tests at each step to make sure the code is still working
espdev
This thing can "fix" tests, not code. It just adjusts tests to incorrect code. So you need to keep an eye on the test code as well. That sounds crazy, of course. You have to constantly keep in mind that LLM doesn't understand what it is doing.
biorach
git commit after each change it makes. It will eventually get itself into a mess. Revert to the last good state and tell it to try a different approach. Squash your commits at the end
noisy_boy
The trick is not to get sucked into making it do 100% of the task and have a judgement of the sweet spot. Provide it proper details upfront along with the desired overall structure - that should settle in about 10-15 mins of back and forth. This must include tests that you have to review manually - again you will find issues and lose time again (say about 30-45mins). Cut your losses and close the lose ends of the test cide. Now run the tests and start giving it discreet tasks to fix the tests. This is easily 20-40 mins. Now take over and go through the while thing yourself because this is where you will find more issues upon in-depth checking (the LLM has done most of what it could) and this where you must understand the code you need to support.
iwasbirchyfirst
I went from copying and pasting with ChatGPT Canvas, to Claude Artifacts, to Cursor w Claude. I haven't explored Rules yet, but have been using the Notepads for much of this. Much of my time is spent managing git commits/reverts, and preparing to bring Claude up to speed after the next chat renewal.
AI coding is like having a partner who is both the smartest person in town, but also a functional alcoholic, who's really, really good at hiding it. LLMs act like they are working in a dark warehouse with a flashlight, and Altzheimers.
They can get an amazing amount of functional code done in a few minutes, and then spend hours trying to fix one detail. My standard prompt begins with, "Don't guess, debug!" They have limited resources and will bs you if they can.
For longer projects, since every prompt is almost starting from scratch (they do have a limited buffer, which will make it easy to become complacent), if you get into repeated debugging sessions, it will start creating new functions instead making exisiting functions work, and code bloat is tremendous. Perhaps Rules work, but I've given up trying to get it to code in my style. I'm trying to have AI do all the coding so I can just be the "idea" guy ("vibe" coding), so I'm learning to let go and let it code in ways the I would hate to maintain. It's working from code examples that don't use my style, so I'm not going to keep fighting it on style (with some execptions like variable naming conventions).
internet_points
> AI coding is like having a partner who is both the smartest person in town, but also a functional alcoholic, who's really, really good at hiding it.
I'm stealing this :-) (And from now on I'll be imagining coding alongside Columbo. Claude Columbo.)
I must be the dumbest "prompt engineer" ever, each time I ask an AI to fix or even worse, create something from scratch it rarely returns the right answer and when asked for modification it will struggle even more.
All the incredible performance and success stories always come from these Twitter posts, I do find value in asking simple but tedious task like a small refactor or generate commands, but this "AI takes the wheel level" does not feel real.