How I Program with Agents

292 comments

·June 9, 2025

zOneLetter

Maybe it's because I only code for my own tools, but I still don't understand the benefit of relying on someone/something else to write your code and then reading it, understand it, fixing it, etc. Although asking an LLM to extract and find the thing I'm looking for in an API Doc is super useful and time saving. To me, it's not even about how good these LLMs get in the future. I just don't like reading other people's code lol.

vmg12

Here are the cases where it helps me (I promise this isn't ai generated even though im using a list...)

- Formulaic code. It basically obviates the need for macros / code gen. The downside is that they are slower and you can't just update the macro and re-generate. The upside is it works for code that is slightly formulaic but has some slight differences across implementations that make macros impossible to use.

- Using apis I am familiar with but don't have memorized. It saves me the effort of doing the google search and scouring the docs. I use typed languages so if it hallucinates the type checker will catch it and I'll need to manually test and set up automated tests anyway so there are plenty of steps where I can catch it if it's doing something really wrong.

- Planning: I think this is actually a very under rated part of llms. If I need to make changes across 10+ files, it really helps to have the llm go through all the files and plan out the changes I'll need to make in a markdown doc. Sometimes the plan is good enough that with a few small tweaks I can tell the llm to just do it but even when it gets some things wrong it's useful for me to follow it partially while tweaking what it got wrong.

Edit: Also, one thing I really like about llm generated code is that it maintains the style / naming conventions of the code in the project. When I'm tired I often stop caring about that kind of thing.

xmprt

> Using apis I am familiar with but don't have memorized

I think you have to be careful here even with a typed language. For example, I generated some Go code recently which execed a shell command and got the output. The generated code used CombinedOutput which is easier to used but doesn't do proper error handling. Everything ran fine until I tested a few error cases and then realized the problem. In other times I asked the agent to write tests cases too and while it scaffolded code to handle error cases, it didn't actually write any tests cases to exercise that - so if you were only doing a cursory review, you would think it was properly tested when in reality it wasn't.

tptacek

You always have to be careful. But worth calling out that using CombinedOutput() like that is also a common flaw in human code.

mlinhares

The downside for formulaic code kinda makes the whole thing useless from my perspective, I can't imagining a case where that works.

Maybe a good case, that i've used a lot, is using "spreadsheet inputs" and teaching the LLM to produce test cases/code based on the spreadsheet data (that I received from elsewhere). The data doesn't change and the tests won't change either so the LLM definitely helps, but this isn't code i'll ever touch again.

dontlikeyoueith

> Maybe a good case, that i've used a lot, is using "spreadsheet inputs" and teaching the LLM to produce test cases/code based on the spreadsheet data (that I received from elsewhere)

This seems weird to me instead of just including the spreadsheet as a test fixture.

vmg12

There is a lot of formulaic code that llms get right 90% of the time that are impossible to build macros for. One example that I've had to deal with is language bridge code for an embedded scripting language. Every function I want available in the scripting environment requires what is essentially a boiler plate function to be written and I had to write a lot of them.

felipeerias

Planning is indeed a very underrated use case.

One of my most productive uses of LLMs was when designing a pipeline from server-side data to the user-facing UI that displays it.

I was able to define the JSON structure and content, the parsing, the internal representation, and the UI that the user sees, simultaneously. It was very powerful to tweak something at either end and see that change propagate forwards and backwards. I was able to hone in on a good solution much faster that it would have been the case otherwise.

j1436go

As a personal anecdote I've tried to create Shell scripts for the testing of a public HTTP API that had pretty good documentation and in both cases the requests did not work. In one case it even hallucinated an endpoint.

owl_vision

plus 1 for using agents for api refresher and discovery. i also use regular search to find possible alternatives and about 3-4 out of 10 normal search wins.

Discovering private api using an agent is super useful.

shitpostbot

[dead]

dataviz1000

I am beginning to love working like this. Plan a design for code. Explain to the LLM the steps to arrive to a solution. Work on reading, understanding, fixing, planing, ect. while the LLM is working on the next section of code. We are working in parallel.

Think of it like being a cook in a restaurant. The order comes in. The cook plans the steps to complete the task of preparing all the elements for a dish. The cook sears the steak and puts it in the broiler. The cook doesn't stop and wait for the steak to finish before continuing. Rather the cook works on other problems and tasks before returning to observe the steak. If the steak isn't finished the cook will return it to the broiler for more cooking. Otherwise the cook will finish the process of plating the steak with sides and garnishes.

The LLM is like the oven, a tool. Maybe grating cheese with a food processor is a better analogy. You could grate the cheese by hand or put the cheese into the food processor port in order to clean up, grab other items from the refrigerator, plan the steps for the next food item to prepare. This is the better analogy because grating cheese could be done by hand and maybe does have a better quality but if it is going into a sauce the grain quality doesn't matter so several minutes are saved by using a food processor which frees up the cook's time while working.

Professional cooks multitask using tools in parallel. Maybe coding will move away from being a linear task writing one line of code at a time.

collingreen

I like your take and the metaphors are good at helping demonstrate by example.

One caveat I wonder about is how this kind of constant context switching combines with the need to think deeply (and defensively with non humans). My gut says I'd struggle at also being the brain at the end of the day instead of just the director/conductor.

I've actively paired with multiple people at once before because of a time crunch (and with a really solid team). It was, to this day, the most fun AND productive "I" have ever been and what you're pitching aligns somewhat with that. HOWEVER, the two people who were driving the keyboards were substantially better engineers than me (and faster thinkers) so the burden of "is this right" was not on me in the way it is when using LLMs.

I don't have any answers here - I see the vision you're pitching and it's a very very powerful one I hope is or becomes possible for me without it just becoming a way to burn out faster by being responsible for the deep understanding without the time to grok it.

dataviz1000

> I've actively paired with multiple people at once

That was my favorite part of being a professional cook, working closely on a team.

Humans are social animals who haven't -- including how our brains are wired -- changed much physiologically in the past 25,000 years. Smart people today are not much smarter than smart people in Greece 3,000 years ago, except for the sample size of 8B people being larger. We are wired to work in groups like hunters taking down a wooly mammoth.[0]

[0] https://sc.edu/uofsc/images/feature_story_images/2023/featur...

KronisLV

> I still don't understand the benefit of relying on someone/something else to write your code and then reading it, understand it, fixing it, etc.

Friction.

A lot of people are bad at getting started (like writer's block, just with code), whereas if you're given a solution for a problem, then you can tweak it, refactor it and alter it in other ways for your needs, without getting too caught up in your head about how to write the thing in the first place. Same with how many of my colleagues have expressed that getting started on a new project from 0 is difficult, because you also need to setup the toolchain and bootstrap a whole app/service/project, very similar to also introducing a new abstraction/mechanism in an existing codebase.

Plus, with LLMs being able to process a lot of data quickly, assuming you have enough context size and money/resources to use that, it can run through your codebase in more detail and notice things that you might now, like: "Oh hey, there are already two audit mechanisms in the codebase in classes Foo and Bar, we might extract the common logic and..." that you'd miss on your own.

divan

On one codebase I work with, there are often tasks that involve changing multiple files in a relatively predictable way. Like there is little creativity/challenge, but a lot of typing in multiple parts/files. Tasks like these used to take 3-4 hours complete before just because I had to physically open all these files, find right places to modify, type the code etc. With AI agent I just describe the task, and it does the job 99% correct, reducing the time from 3-4 hours to 3-4 minutes.

majormajor

Amusingly, cursor took 5 minutes trying to figure out how to do what a simple global find/replace did for me in 30 seconds after I got tired of waiting for it's attempt just last night on a simple predictable lots-of-files change.

A 60x speedup is way more than I've seen even in its best case for things like that.

divan

In my experience, two things makes a big difference for AI agents: quality of code (naming and structure mostly) and AI-friendly documentation and tasks planning. For example, in some repos I have legacy naming that evolved after some refactoring, and while devs know that "X means Y", it's not easy for AI to figure it out unless explicitly documented. I'm still learning how to organize AI-oriented codebase documentation and planning tools (like claude task master), but they do make a big difference indeed.

throwawayscrapd

Did you ever consider refactoring the code so that you don't have to do shotgun surgery every time you make this kind of change?

osigurdson

You mean to future proof the code so requirements changes are easy to implement? Yeah, I've seen lots of code like that (some of it written by myself). Usually the envisioned future never materializes unfortunately.

divan

It's a monorepo with backend/frontend/database migrations/protobufs. Could you suggest how exactly should I refactor it so I don't need to make changes in all these parts of the codebase?

jf22

At this point why spend 5 hours refactoring when I can spend 5 minutes shot gunning the changes in?

At the same time refactoring probably takes 10 minutes with AI.

x0x0

A lot of that is inherent in the framework. eg Java and Go spew boilerplate. LLMs are actually pretty good at generating boilerplate.

See, also, testing. There's a lot of similar boilerplate for testing. Giving LLMs a list of "Test these specific items, with this specific setup, and these edge cases." I've been pretty happy writing a bulleted outline of tests and getting ... 85% complete code back? You can see a pretty stark line in a codebase I work on where I started doing this vs comprehensiveness of testing.

gyomu

So you went from being able to handle at most 10 or so of these tasks you often get per week, to >500/week. Did you reap any workplace benefits from this insane boost in productivity?

davely

My house has never been cleaner. I have time to catch up on chores that I normally do during the weekend. Dishes, laundry, walk the dog more.

It seems silly but it’s opened up a lot of extra time for some of this stuff. Heck, I even play my guitar more, something I’ve neglected for years. Noodle around while I wait for Claude to finish something and then I review it.

All in all, I dig this new world. But I also code JS web apps for a living, so just about the easiest code for an LLM to tackle.

EDIT: Though I think you are asking about work specifically. i.e., does management recognize your contributions and reward you?

For me, no. But like I said, I get more done at work and more done at home. It’s weird. And awesome.

com2kid

I used to spend time writing regex's do to this for me, now LLMs solve it in less time than it takes me to debug my one off regex!

bgwalter

Some people cannot do anything without a tool. These people are early adopters and power users, who then evangelize their latest discovery.

GitHub's value proposition was that mediocre coders can appear productive in the maze of PRs, reviews, green squares, todo lists etc.

LLMs again give mediocre coders the appearance of being productive by juggling non-essential tools and agents (which their managers also love).

danielbln

What is an essential tool? IDE? Editor? Pencil? Can I scratch my code into a French cave wall if I want to be a senior developer?

therein

I think it is very simple to draw the line at "something that tries to write for you", you know, an agent by definition. I am beginning to realize people simply would prefer to manage, even if the things they end up managing aren't actually humans. So it creates a nice live action role-play situation.

A better name for vibecoding would be larpcoding, because you are doing a live action role-play of managing a staff of engineers.

Now not only even a junior engineer can become a manager, they will start off their careers managing instead of doing. Terrifying.

osigurdson

I felt the same way until recently (like last Friday recently). While tools like Windsurf / Cursor have some utility, most of the time I am just waiting around for them while I get to read and correct the output. Essentially, I'm helping out with the training while paying to use the tool. However, now that Codex is available in ChatGPT plus, I appreciate that asynchronous flow very much. Especially for making small improvements , fixing minor bugs, etc. This has obvious value imo. What I like to do is queue up 5 - 10 tasks and the. focus on hard problems while it is working away. Then when I need a break I review / merge those PRs.

buffalobuffalo

I kinda consider it a P!=nP type thing. If I need to write a simple function, it will almost always take me more time to implement it than it will to verify if an implementation of it suits my needs. There are exceptions, but overall when coding with LLMs this seems to hold true. Asking the LLM to write the function then checking it's work is a time saver.

worldsayshi

I think this perspective is kinda key. Shifting attention towards more and better ways to verify code can probably lead to improved quality instead of degraded.

moritonal

I see it as basically Cunningham's Law. It's easier to see the LLM's attempt a solution and how it's wrong than to write a perfectly correct solution first time.

a_tartaruga

Came here to post this it is precisely right.

marvstazar

As a senior developer you already spend a significant amount of time planning new feature implementations and reviewing other people's code (PRs). I find that this skill transitions quite nicely to working with coding agents.

munificent

I don't disagree but... wouldn't you rather be working with actual people?

Spending the whole day chatting with AI agents sounds like a worst-of-both-worlds scenarios. I have to bring all of my complex, subtle soft skills into play which are difficult and tiring to use, and in the end none of that went towards actually fostering real relationships with real people.

At the end of the day, are you gonna have a beer with your agents and tell them, "Wow, we really knocked it out of the park today?"

Spending all day talking to virtual coworkers is literally the loneliest experience I can imagine, infinitely worse than actually coding in solitude the entire day.

solatic

It's a double-edged sword. AI agents don't have a long-term context window that gets better over time. People who employ AI agents today instead of juniors are going to find themselves in another local maximum: yes, the AI agent will make you more productive today compared to a junior, but (as the tech stands today) you will never be able to promote an AI agent to senior or staff, and you will not get to hire out an army of thousands of engineers that lets you deliver the sheer throughput that FAANG / Fortune 500 are capable of. You will be stuck at some shorter level of feature-delivery capacity.

cwyers

My employer can't go out and get me three actual people to work under me for $30 a month.

EDIT: You can quibble on the exact rate of people's worth of work versus the cost of these tools, but look at what a single seat on Copilot or Cursor or Windsurf gets you, and you can see that if they are only barely more productive than you working without them, the economics are it's cheaper to "hire" virtual juniors than real juniors. And the virtual juniors are getting better by the month, go look at the Aider leaderboards and compare recent models to older ones.

majormajor

You will hit two problems in this "only hire virtual juniors" thing:

* the wall of how much you can review in one day without your quality slipping now that there's far less variation in your day

* the long-term planning difficulties around future changes when you are now the only human responsible for 5-20x more code surface area

* the operational burden of keeping all that running

The tools might get good enough that you only need 5 engineers to do what used to be 10-20. But the product folks aren't gonna stop wanting you to keep churning out the changes, and the last 2 years of evolution of these models doesn't seem like it's on a trajectory to cut that down to 1 (or 0) without unforeseen breakthroughs.

aqme28

Yeah was going to make the same point.

> I still don't understand the benefit of relying on someone/something else to write your code and then reading it, understand it, fixing it, etc.

What they're saying is that they never have coworkers.

colonelspace

They're also saying that they don't understand that writing code costs businesses money.

worldsayshi

Exactly!

svaha1728

I completely agree with the author's comment that code review is half-hearted and mostly broken. With agents, the bottleneck is really in reading code, not writing it. If everyone is just half-heartedly reviewing code, or using it as a soapbox for their individual preferences, using agents will completely fall apart as they can easily introduce serious security issues or performance hits.

Let's be honest, many of those can't be found by just 'reading' the code, you have to get your hands dirty and manually debug/or test the assumptions.

rco8786

What’s not clear to me is how agents/AI written code solves the “half hearted review” problem.

People don’t like to do code reviews because it sucks. It’s tedious and boring.

I genuinely hope that we’re not giving up the fun parts of software, writing code, and in exchange getting a mountain of code to read and review instead.

thunspa

Yes, this is what I'm fearing as well.

That we will end up just trying to review code, writing tests and some kind of specifications in natural language (which is very imprecise)

However, I can't see how this approach would ever scale to a larger project.

barrenko

Yeah, honestly what's currently missing from the marketplace is a better way to read all of the code, the diffs etc. that the LLMs output, like how do you review it properly and gain an understanding of the codebase, since you're the person writing a very very small part of it.

Or even to make sure that the humans left in the project actually read the code instead of just swiping next.

Joof

Isn't that the point of agents?

Assume we have excellent test coverage -- the AI can write the code and ensure get the feedback for it being secure / fast / etc.

And the AI can help us write the damn tests!

ofjcihen

No, it can’t. Partially stems from the garbage the models were trained on.

Example anecdata but since we started having our devs heavily use agents we’ve had a resurgence of mostly dead vulnerabilities such as RCEs (CVE in 2019 for example) as well as a plethora of injection issues.

When asked how these made it in devs are responding with “I asked the LLM and it said it was secure. I even typed MAKE IT SECURE!”

If you don’t sufficiently understand something enough then you don’t know enough to call bs. In cases like this it doesn’t matter how many times the agent iterates.

klabb3

To add to this: I’ve never been gaslighted more convincingly than by an LLM, ever. The arguments they make look so convincing. They can even naturally address specific questions and counter-arguments, while being completely wrong. This is particularly bad with security and crypto, which generally isn’t verified through testing (which only proves the presence of function, not the absence).

thunspa

Saw Rich Hickey say this, that it is a known fact that tested code never has bugs.

On a more serious note: how could anyone possibly ever write meaningful tests without a deep understanding of the code that is being written?

quantumHazer

Finally some serious writing about LLMs that doesn’t follow the hype and it faces reality of what can and can’t be useful with these tools.

Really interesting read, although I can’t stand the word “agent” for a for-loop that call recursively an LLM, but this industry is not famous for being sharp with naming things, so here we are.

edit: grammar

aryehof

I agree with not liking the author’s definition of an Agent being … “a for loop which contains an LLM call”.

Instead it is an LLM calling tools/resources in a loop. The difference is subtle and a question of what is in charge.

diggan

Although implementation/internal wise it's not wrong to say it's just an llm call in a loop. If the llm responds with a tool call, you (the implementor) needs to program the call to happen, then loop back and let the llm continue.

The model/weights themselves do not execute tool calls unless the tooling around it helps them do it, and loops it.

bicepjai

I liked the phrase “tools in a loop” for agents. I think Simon said that

aryehof

He was quoting someone else. Please take care not to attribute falsely, as it creates a falsehood likely to spread and become the new (un) truth.

potatolicious

I actually take some minor issue with OP's definition of an agent. IMO an agent isn't just a LLM on a loop.

IMO the defining feature of an agent is that the LLM's behavior is being constrained or steered by some other logical component. Some of these things are deterministic while others are also ML-powered (including LLMs).

Which is to say, the LLM is being programmed in some way.

For example, prompting the LLM to build and run tests after code edits is a great way to get better performance out of it. But the idea is that you're designing a system where a deterministic layer (your tests) is nudging the LLM to do more useful things.

Likewise many "agentic reasoning" systems deliberately force the LLM to write out a plan before execution. Sometimes these plans can even be validated deterministically, and the LLM forced to re-gen if plan is no good.

The idea that the LLM is feeding itself isn't inaccurate, but misses IMO the defining way these systems are useful: they're being intentionally guided along the way by various other components that oversee the LLM's behavior.

biophysboy

Can you explain the interface between the LLM and the deterministic system? I’m not understanding how a probabilistic machine output can reliably map onto a strict input schema.

potatolicious

So it's pretty early-days for these kinds of systems, so there's no "one true" architecture that people have settled on. There are two broad variations that I see:

1 - The LLM is in charge and at the top of the stack. The deterministic bits are exposed to the LLM as tools, but you instruct the LLM specifically to use them in a particular way. For example: "Generate this code, and then run the build and tests. Do not proceed with more code generation until build and tests successfully pass. Fix any errors reported at the build and test step before continuing." This mostly works fine, but of course subject to the LLM not following instructions reliably (worse as context gets longer).

2 - A deterministic system is at the top, and uses LLMs in an otherwise-scripted program. This potentially works better when the domain the LLM is meant to solve is narrow and well-understood. In this case the structure of the system is more like a traditional program, but one that calls out to LLMs as-needed to fulfill certain tasks.

> "I’m not understanding how a probabilistic machine output can reliably map onto a strict input schema."

So there are two tricks to this:

1 - You can actually force the machine output into strict schemas. Basically all of the large model providers now support outputting in defined schemas - heck, Apple just announced their on-device LLM which can do that as well. If you want the LLM to output in a specified schema with guarantees of correctness, this is trivial to do today! This is fundamental to tool-calling.

2 - But often you don't actually want to force the LLM into strict schemas. For the coding tool example above where the LLM runs build/tests, it's often much more productive to directly expose stdout/stderr to the LLM. If the program crashed on a test, it's often very productive to just dump the stack trace as plaintext at the LLM, rather than try to coerce the data into a stronger structure and then show it to the LLM.

How much structure vs. freeform is very much domain-specific, but the important realization is that more structure isn't always good.

To make the example concrete, an example would be something like:

[LLM generates a bunch of code, in a structured format that your IDE understands and can convert into a diff]

[LLM issues the `build_and_test` tool call at your IDE. Your IDE executes the build and tests.]

[Build and tests (deterministic) complete, IDE returns the output to the LLM. This can be unstructured or structured.]

[LLM does the next thing]

vdfs

> prompting the LLM to build and run tests after code edits

Isn't that done by passing function definitions or "tools" to the llm?

beebmam

Thanks for this comment, i totally agree. Not to say this article isnt good; its great!

closewith

It seems like an excellent name, given that people understand it so readily, but what else would you suggest? LoopGPT?

layer8

RePT

quantumHazer

I’m no better at naming things! Shall we propose LLM feedback loop systems? It’s more grounded in reality. Agent is like Retina Display to my ears, at least at this stage!

minikomi

A downward spiral

closewith

Agent is clear in that it acts on behalf of the user.

"LLM feedback loop systems" could be to do with training, customer service, etc.

> Agent is like Retina Display to my ears, at least at this stage!

Retina is a great name. People know what it means - high quality screens.

solomonb

A state machine, or more specifically a Moore Machine.

gk1

> Overall, we are convinced that containers can be useful and warranted for programming.

Last week Solomon Hykes (creator of Docker) open-sourced[1] Container Use[2] exactly for this reason, to let agents run in parallel safely. Sharing it here because while Sketch seems to have isolated + local dev environments built in (cool!), no other coding agent does (afaik).

[1] https://www.youtube.com/live/U-fMsbY-kHY?si=AAswZKdyatM9QKCb... - fun to watch regardless

[2] https://github.com/dagger/container-use

asim

The agentic loop. The brain in the machine. Effectively a replacement for the rules engine. Still with a lot of quirks but crawshaw and many others from the Google era have a great way of distilling it down to its essence. It provides clarity for me as I see it over and over. Connect the agent tools, prompt it via some user request and let it go, and then repeat this process, maybe the prompt evolves over time to be a response from elsewhere, who knows. But essentially putting aside attempts to mimic human interaction and problem solving, it's going to be a useful tool for replacing orchestration or multi-step tasks that are somewhat ambiguous. That ambiguity is what we had to code before, and maybe now it'll be gone. In a production environment maybe there's a bit of a worry of executing things without a dry run but our tools, services, etc will evolve.

I am personally really interested to see what happens when you connect this in an environment of 100+ services that all look the same, behave the same and provide a consistent path to interacting with the world e.g sms, mail, weather, social, etc. When you can give it all the generic abstractions for everything we use, it can become a better assistant than what we have now or possibly even more than that.

randito

> a consistent path to interacting with the world e.g sms, mail, weather, social, etc.

Here's an interesting toy-project where someone hooked up agents to calendars, weather, etc and made a little game interface for it. https://www.geoffreylitt.com/2025/04/12/how-i-made-a-useful-...

sothatsit

> When you can give it all the generic abstractions for everything we use, it can become a better assistant than what we have now or possibly even more than that.

The range of possibilities also comes with a terrifying range of things that could go wrong...

Reliability engineering, quality assurance, permissions management, security, and privacy concerns are going to be very important in the near future.

People criticize Apple for being slow to release a better voice assistant than Siri that can do more, but I wonder how much of their trepidation comes from these concerns. Maybe they're waiting for someone else to jump on the grenade first.

dkarl

Reading code has always been as important as writing it. Now it's becoming more important. This is my nightmare. Writing code can be joy at times; reading it is always work.

a_tartaruga

Don't worry you will still get to do plenty / more of the most fun thing: fixing code.

null

[deleted]

voidUpdate

I wonder how many people that use agents actually like "programming", as in coming up with a solution to the problem and then being able to express that in code. It seems like a lot of the work that the agents are doing is removing that and instead making you have to explain what you want in natural language and hope the LLM doesn't introduce bugs

hombre_fatal

I like writing code, and it definitely isn't satisfying when an LLM can one-shot a parser that I would have had fun building for hours.

But at the same time, building a parser for hours is also a distraction from my higher level ambitions with the project, and I get to focus on those.

I still get to stub out the types and function signatures I want, but the LLM can fill them in and I move on. More likely I'll even have my go at the implementation but then tag in the LLM when it's not fun anymore.

On the other hand, LLMs have helped me focus on the fun of polishing something. Making sweeping changes are no longer in the realm of "it'd be nice but I can't be bothered". Generating a bunch of tests from examples isn't grueling anymore. Syncing code to the readme isn't annoying anymore. Coming up with refactoring/improvement ideas is easy; just ask and tell it to make the case for you. It has let me be far more ambitious or take a weekend project to a whole new level, and that's fun.

It's actually a software-loving builder's paradise if you can tweak your mindset. You can polish more code, release more projects, tackle more nerdsnipes, and aim much higher. But it took me a while to get over what turned out to be some sort of resentment.

bubblyworld

I agree, agents have really made programming fun for me again (and I say this as someone who has been coding for more two decades - I'm not a script kiddy using them to make up for lack of skill).

Configuring tools, mindless refactors, boilerplate, basic unit/property testing, all that routine stuff is a thing of the past for me now. It used to be a serious blocker for me with my personal projects! Getting bored before I got anywhere interesting. Much of the time I can stick to writing the fun/critical code now and glue everything else together with LLMs, which is awesome.

Some people obviously like the fiddly stuff though, and more power to them, it's just not for me.

Verdex

Parsing is an area that I'm interested in. Can you talk more about your experience getting LLMs to one-shot parsers?

From scratch LLMs seem to be completely lost writing parsers. The bleeding edge appears to be able to maybe parse xml, but gives up on programming languages with even the most minimal complexity (an example being C where Gemini refused to even try with macros and then when told to parse C without macros gave an answer with several stubs where I was supposed to fill in the details).

With parsing libraries they seem better, but ultimately that reduces to transform this bnf. Which if I had to I could do deterministically without an LLM.

Also, my best 'successes' have been along the lines of 'parse in this well defined language that just happens to have dozens if not hundreds of verbatim examples on github'. Anytime I try to give examples of a hypothetical language then they return a bunch of regex that would not work in general.

wrs

A few weeks ago I gave an LLM (Gemini 2.5 something in Cursor) a bunch of examples of a new language, and asked it to write a recursive descent parser in Ruby. The language was nothing crazy, intentionally reminiscent of C/JS style, but certainly the exact definition was new. I didn’t want to use a parser generator because (a) I’d have to learn a new one for Ruby, and (b) I’ve always found it easier to generate useful error messages with a handwritten recursive descent parser.

IIRC, it went like this: I had it first write out the BNF based on the examples, and tweaked that a bit to match my intention. Then I had it write the lexer, and a bunch of tests for the lexer. I had it rewrite the lexer to use one big regex with named captures per token. Then I told it to write the parser. I told it to try again using a consistent style in the parser functions (when to do lookahead and how to do backtracking) and it rewrote it. I told it to write a bunch of parser tests, which I tweaked and refactored for readability (with LLM doing the grunt work). During this process it fixed most of its own bugs based on looking at failed tests.

Throughout this process I had to monitor every step and fix the occasional stupidity and wrong turn, but it felt like using a power tool, you just have to keep it aimed the right way so it does what you want.

The end result worked just fine, the code is quite readable and maintainable, and I’ve continued with that codebase since. That was a day of work that would have taken me more like a week without the LLM. And there is no parser generator I’m aware of that starts with examples rather than a grammar.

timeinput

> I still get to stub out the types and function signatures I want, but the LLM can fill them in and I move on. More likely I'll even have my go at the implementation but then tag in the LLM when it's not fun anymore.

This is the best part for me. I can design my program the way I want. Then hack at the implementation, get it close, and then say okay finish it up (fix the current compiler errors, write and run some unit tests etc).

Then when it's time to write some boiler plate / do some boiler plate refactoring it's extract function xxx into a trait. Write a struct that does xxx and implements that trait.

I'm not over the resentment entirely, and if someone were to push me to join a team that coded by creating github issues, and reviewing the PRs I would probably hate that job, I certainly do when I try to do that in my free time.

In wood working you can use hand tools or power tools. I use hand tools when I want to use them either for a particular effect, or just the joy of using them, and I don't resent having to use a circular saw, or orbital sander when that's the tool I want to use, or the job calls for it. To stretch the analogy developing with plain text prompts and reviewing PRs feels more like assembling Ikea furniture. Frustrating and dull. A machine did most of the work cutting out the parts, and now I need to figure out what they want me to do with them.

sanderjd

This is exactly my take as well!

I do really like programming qua programming, and I relate to a lot of the lamentation I see from people in these threads at the devaluation of this skill.

But there are lots of other things that I also enjoy doing, and these tools are opening up so many opportunities now. I have had tons of ideas for things I want to learn how to do or that I want to build that I have abandoned because I concluded they would require too much time. Not all, but many, of those things are now way easier to do. Tons of things are now under the activation energy to make them worthwhile, which were previously well beyond it.

Just as a very narrow example, I've been taking on a lot more large scale refactorings to make little improvements that I've always wanted to make, but which have not previously been worth the effort, but now are.

qsort

I have to flip the question, what is it that people like about it? I certainly don't enjoy writing code for problems that have already been solved a thousand times. We reach for a dictionary, we don't write a hash table from scratch every time, that's only fun the first time you do it.

If I could go "give me a working compiler for this language" or "solve this problem using a depth-first search" I wouldn't enjoy programming any less.

About the natural language and also in response to the sibling comment, I agree, natural language is a very poor tool to describe computational processes. It's like doing math in plain English, fine for toy examples, but at a certain level of sophistication it's way too easy to say imprecise or even completely contradictory things. But nobody here advocates using LLMs "blind"! You're still responsible for your own output, whether it was generated or not.

voidUpdate

Why do people enjoy going to the gym? Those weights have already been lifted a thousand times.

I enjoy writing code because of the satisfaction that comes from solving a problem, from being able to create a working thing out of my own head, and to hopefully see myself getting better at programming. I could augment my programming abilities with an LLM in the same way you could augment your gym experience with a forklift. I like to do it because I'm doing it. If I could go "give me a working compiler for this language", I wouldn't enjoy it anymore, because I've not gained anything from it. Obviously I don't re-implement a dictionary every time I need one, because its part of the "standard library" of basically everything I code in. And if it isn't, part of the fun is the challenge of either working out another way to do it, or reimplementing it.

qsort

We are talking past each other here.

Once I solved an Advent of Code problem, I felt like the problem wasn't general enough, so I solved the more general version as well. I like programming to the point of doing imaginary homework, then writing myself some extra credit and doing that as well. Way too much for my own good.

The point is that solving a new problem is interesting. Solving a problem you already know exactly how to solve isn't interesting and isn't even intellectual exercise. I would gain approximately zero from writing a new hash table from scratch whenever I needed one instead of just using std::map.

Problem solving absolutely is a muscle and it's use it or lose it, but you don't train problem solving by solving the same problem over and over.

sanderjd

I think this is a good analogy! But I draw a different conclusion from it.

You're right that you wouldn't want to use a forklift to lift the weights at a gym. But then why do forklifts exist? Well, because gyms aren't the only place where people lift heavy things. People also lift and move around big pallets of heavy stuff at their jobs. And even if those people are gym rats, they don't forgo the forklift when they're at work, because it's more efficient, and exercising isn't the goal, at work.

In much the same way, it's would be silly to have an LLM write the solutions while working through the exercises in a book or advent of code or whatever. Those are exercises that are akin to going to the gym.

But it would also be silly to refuse to use the available tools to more efficiently solve problems at work. That would be like refusing to use a forklift.

infecto

Different strokes for different folks. I have written crud apps and other simple implementations thousands of times it feels like. My satisfaction is derived from building something useful not just the sale of building.

falcor84

> Why do people enjoy going to the gym?

Do they? I would assume that the overwhelming majority of people would be very happy to be able to get 50% of the results for twice the membership cost if they could avoid going.

BeetleB

OK. Be honest. If you had to write an argument parser once a week, would you enjoy it?

Or extracting input from a config file?

Or setting up a logger?

quantumHazer

Exactly. Also related on why Natural Language is not really good for programming[0]

[0]: https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...

Anyway I indeed find LLMs useful for stackoverflow-like programming questions. But this seems to not be true for long as SO is dying and updated data on this type of questions will shrink I think.

infecto

Don’t agree with the assessment. At this point most of what I find LLM taking over is all the repetitive crud like implementations. I am still doing what I consider the fun parts, architecting the project and solving what are still the hard parts for the LLM, the non crud parts. This could be gone in a year and maybe I become a glorified product manager but enjoying it for the time being l, I can focus on the real thought problems and get help lifting the crud or repetitive patterns.

voidUpdate

If you keep asking an LLM to generate the same repetitive implementations, why not just have a basic project already set up that you can modify as needed?

bluefirebrand

Yeah, I don't really get this

Most boilerplate I write has a template that I can copy and paste then run a couple of "find and replace" on and get going right away

This is not a substantial blocker or time investment that an AI can save me imo

infecto

The LLM is doing the modifications and specific nuance that I want. Saves me time, ymmv.

sanderjd

Because they are similar and repetitive, but not identical.

crawshaw

Author here. I like programming and I like agents.

null

[deleted]

verifex

Some of my favorite things to use AI for when coding (I swear I wrote this not AI!):

- CSS: I don't like working with CSS on any website ever, and all of the kludges added on-top of it don't make it any more fun. AI makes it a little fun since it can remember all the CSS hacks so I don't have to spend an hour figuring out how to center some element on the page. Even if it doesn't get it right the first time, it still takes less time than me struggling with it to center some div in a complex Wordpress or other nightmare site.

- Unit Tests: Assuming the embedded code in the AI isn't too outdated (caveat: sometimes it is, and that invalidates this one sometimes). Farming out unit tests to AI is a fun little exercise.

- Summarizing a commit: It's not bad at summarizing, at least an initial draft.

- Very small first-year-software-engineering-exercise-type tasks.

topek

Interesting, I found AIs annoyingly incapable of writing good CSS. But I understand the appeal of using it for a task that you do not like to do yourself. For me it's writing ticket descriptions which it does way better than me.

Aachen

Can you give an example?

Descriptions for things was the #1 example for me where LLMs are a hindrance, so I'm surprised to hear this. If the LLM (not working at this company / having a limited context window) gets your meaning from bullet points or keywords and writes nice prose, I could just read that shorthand (your input aka prompt) and not have to bother with the wordiness. But apparently you've managed to find a use for it?

mvdtnz

I'm not trying to be presumptuous about the state of your CSS knowledge so tell me to get lost if I'm off base. But if you haven't updated yourself on where CSS is at these days I'd recommend spending an afternoon doing a deep dive. Modern-day CSS is way less kludgy and hacky than it used to be. It's not so hard now to manage large CSS codebases and centering elements is relatively simple now.

Having said that I still lean heavily on AI to do my styling too these days.

atrettel

The "assets" and "debt" discussion near the middle is interesting, but I can't say that I agree.

Yes, many programs are not used my many users, but many programs that have a lot of users now and have existed for a long time started with a small audience and were only intended to be used for a short time. I cannot tell you how many times I have encountered scientific code that was haphazardly written for one purpose years ago that has expanded well beyond its scope and well beyond its initial intended lifetime. Based on those experiences, I write my code well aware that it may be used for longer than I anticipated and in a broader scope than I anticipated. I do this as both a courtesy for myself and for others. If you have had to work on a codebase that started out as somebody's personal project and then got elevated by a manager to a group project, you would understand.

spenczar5

The issue is, whats the alternative? People are generally bad at predicting what work will get broad adoption. Carefully elegantly constructing a project that goes nowhere also seems to be a common failure mode; there is a sort of evolutionary pressure towards sloppy projects succeeding because they are cheaper to produce.

This reminds me of classics like "worse is better," for today's age (https://www.dreamsongs.com/RiseOfWorseIsBetter.html)

atrettel

You're right that there isn't a good alternative. I'll just describe that I try to do even if it is inadequate. I write the code as obviously as possible without taking more time (as a courtesy to myself), and I then document the scope of what I am writing when I write the code (what I intend for it to do and intend for it to not do). The documentation is a CYA measure. That way, if something does get elevated, well, I've described its limitations upfront.

And to be frank, in scientific circles, having documentation at all is a good smell test. I've seen so many projects that contain absolutely no documentation, so it is really easy to forget about the capabilities and limitations of a piece of software. It's all just taught through experience and conversations with other people. I'd rather have something in writing so that nobody, especially managers, misinterprets what a piece of software was designed to do or be good at. Even a short README saying this person wrote this piece of software to do this one task and only this one task is excellent.

bArray

LLMs for code review, rather than code writing/design could be the killer feature. I think that code review has been broken for a while now, but this could be a way forward. Of particular interest would be security, undefined behaviour, basic misuse of features, double checking warnings out of the compiler against the source code to ensure it isn't something more serious, etc.

My current use of LLMs is typically via the search engine when trying to get information about an error. It has maybe a 50% hit rate, which is okay because I'm typically asking about an edge case.

rectang

ChatGPT is great for debugging common issues that have been written about extensively on the web (before the training cutoff). It's a synthesizer of Stack Overflow and greatly cuts down on the time it takes to figure out what's going on compared with searching for discussions and reading them individually.

(This IP rightly belongs to the Stack Overflow contributors and is licensed to Stack Overflow. It ought to be those parties who are exploiting it. I have mixed feelings about participating as a user.)

However, the LLM output is also noisy because of hallucinations — just less noisy than web searching.

I imagine that an LLM could assess a codebase and find common mistakes, problematic function/API invocations, etc. However, there would also be a lot of false positives. Are people using LLMs that way?

flir

If you do "please review this code" in a loop, you'll eventually find a case where the chatbot starts by changing X to Y, and a bit later changes Y back to X.

It works for code review, but you have to be judicious about which changes you accept and which you reject. If you know enough to know an improvement when you see one, it's pretty great at spitting out candidate changes which you can then accept or reject.

monkeydust

Why isn't this spoken more about? Not a developer but work very closely with many - they are all on a spectrum from zero interest in this technology to actively using it to write code (correlates inversely seniority from my sample set) - very little talk on using it for reviews/checks - perhaps that needs to be done passively on commit.

bkolobara

The main issue with LLMs is that they can't "judge" contributions correctly. Their review is very nitpicky on things that don't matter and often misses big issues that a human familiar with the codebase would recognise. It's almost just noise at the end.

That's why everyone is moving to the agent thing. Even if the LLM makes a bunch of mistakes, you still have a human doing the decision making and get some determinism.

null

[deleted]

fwip

So far, it seems pretty bad at code review. You'd get more mileage by configuring a linter.

asabla

> LLMs for code review, rather than code writing/design could be the killer feature

This is already available on GitHub using Copilot as a reviewer. It's not the best suggestions, but usable enough to continue having in the loop.

brendanator

Totally agree - we’re working on this at https://sourcery.ai

galaxyLogic

I think what AI "should" be good at is writing code that passes unit-tests written by me the Human.

AI cannot know what we want it to write - unless we tell it exactly what we want by writing some unit-tests and tell it we want code that passes them.

But is any LLM able to do that?

warmwaffles

You can write the tests first and tell the AI to do the implementation and give it some guidance. I usually go the other direction though, I tell the LLM to stub the tests out and let me fill in the details.

afro88

Great post, and sums up my recent experience with Cursor. There has been a jump in effectiveness that only happened recently, that is articulated well very late in the post:

> The answer is a critical chunk of the work for making agents useful is in the training process of the underlying models. The LLMs of 2023 could not drive agents, the LLMs of 2025 are optimized for it. Models have to robustly call the tools they are given and make good use of them. We are only now starting to see frontier models that are good at this. And while our goal is to eventually work entirely with open models, the open models are trailing the frontier models in our tool calling evals. We are confident the story will change in six months, but for now, useful repeated tool calling is a new feature for the underlying models.

So yes, a software engineering agent is a simple for-loop. But it can only be a simple for-loop because the models have been trained really well for tool use.

In my experience Gemini Pro 2.5 was the first to show promise here. Claude Sonnet / Opus 4 are both a jump up in quality here though. Very rare that tool use fails, and even rarer that it can't resolve the issue on the next loop.

HN

How I Program with Agents

How I Program with Agents