Skip to content(if available)orjump to list(if available)

AI can code, but it can't build software

simonw

This is a good headline. LLMs are remarkably good at writing code. Writing code isn't the same thing as delivering working software.

A human expert needs to identify the need for software, decide what the software should do, figure out what's feasible to deliver, build the first version (AI can help a bunch here), evaluate what they've built, show it to users, talk to them about whether it's fit for purpose, iterate based on their feedback, deploy and communicate the value of the software, and manage its existence and continued evolution in the future.

Some of that stuff can be handled by non-developer humans working with LLMs, but a human expert needs who understands code will be able to do this stuff a whole lot more effectively.

I guess the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers, or if programmers can pick up enough enough PM skills to work without PMs.

My money is on both roles continuing to exist and benefit from each other, in a partnership that produces results a lot faster because the previously slow "writing the code" part is a lot faster than it used to be.

samsolomon

I think you're right, the roles will exist for some time. But I think we'll start to see more and more overlap between engineering, product management and design.

In a lot of ways I think that will lead to stronger delivery teams. As a designer—the best performing teams I've been on have individuals with a core competency, but a lot of overlap in other areas. Product managers with strong engineering instincts, engineers with strong design instincts, etc. When there is less ambiguity in communication, teams deliver better software.

Longer-term I'm unsure. Maybe there is some sort of fusion into all-purpose product people able to do everything?

roxolotl

One of the interesting corollaries of the title is that this can also be true of humans. Being able to code is not the same as being a software engineer. It never has been.

bloppe

At least you can teach a human to become a software engineer.

prmph

> LLMs are remarkably good at writing code.

Just this past weekend, I've designed and written code (in Typescript) that I don't think LLMs can even come close to writing in years. I have a subscription to a frontier LLM, but lately I find myself using like 25% of the time.

At a certain level the software architecture problems I'm solving, drawing upon decades of understanding about maintainable, performant, and verifiable design of data structures and types and algorithms, are things LLMs cannot even begin to grasp.

At that point, I find that attempting to use an LLM to even draft an initial solution is a waste of time. At best I can use it for initial brainstorming.

The people saying LLM can code are hard for me to understand. They are good for simple bash scripts and complex refactoring and drafting basic code idioms and that's about it.

And even for these tasks the amount of hand-holding I need to do is substantial. At least Gemini Pro/CLI seems good at one-shot performance, before it's context gets poisoned

CjHuber

Can you maybe give an example you’ve encountered of an algorithm or a data structure that LLMs cannot handle well?

In my experience implementing algorithms from a good comprehensive description and keeping track of data models is where they shine the most.

crazygringo

> The people saying LLM can code are hard for me to understand.

Just today, I spent an hour documenting a function that performs a set of complex scientific simulations. Defined the function input structure, the outputs, and put a bunch of references in the body to function calls it would use.

I then spent 15 minutes explaining to the free version of ChatGPT what the function needs to do both in scientific terms and in computer architecture terms (e.g. what needed to be separated out for unit tests). Then it asked me to answer ~15 questions it had (most were yes/no, it took about 5 min), then it output around 700 lines of code.

It took me about 5 minutes to get it working, since it had a few typos. It ran.

Then I spent another 15 minutes laying out all the categories of unit tests and sanity tests I wanted it to write. It produced ~1500 lines of tests. It took me half an hour to read through them all, adjusting some edge cases that didn't make sense to me and adjusting the code accordingly. And a couple cases where it was testing the right part of the code, but had made valiant but wrong guesses as to what the scientifically correct answer would be. All the tests then passed.

All in all, a little over two hours. And it ran perfectly. In contrast, writing the code and tests myself entirely by hand would have taken at least a couple of entire days.

So when you say they're good for those simple things you list and "that's about it", I couldn't disagree more. In fact, I find myself relying on them more and more for the hardest scientific and algorithmic programming, when I provide the design and the code is relatively self-contained and tests can ensure correctness. I do the thinking, it does the coding.

prmph

> documenting a function that performs a set of complex scientific simulations.

The example you gave sound like the problem is deterministic, even if composed of many moving parts. That's one way of looking at complexity.

When I say talk about complex problems I'm not just talking about intricate problems. I'm talking about problems where "problem" is design, not just implementing a design, and that is where LLMs struggle a lot.

Example, I want to design a strongly typed fluent API interface to some functionality. Even knowing how to shape the this fluent interface so that is powerful, intuitive, well/strongly typed, and maintainable is a deep art.

The intuitive design constraints that I'm designing under would be hard to even explain to an LLM.

DougWebb

> Just today, I spent an hour documenting a function that performs a set of complex scientific simulations. Defined the function input structure, the outputs, and put a bunch of references in the body to function calls it would use.

So that's... math. A very well defined problem, defined very well. Any decent programmer should be able to produce working software from that, and it's great that ChatGPT was able to help you get it done much faster than you could have done it yourself. That's also the kind of project that's very well suited for unit testing, because again: math. Functions with well defined inputs, outputs, and no side-effects.

Only a tiny subset of software development projects are like that though.

airstrike

I find LLMs most helpful when I already have half of the answer written and need them to fill in the blanks.

"Take X and Y I've written before, some documentation for Z, an example W from that repo, now smash them together and build the thing I need"

veegee

Lmao I was about to agree with you and then you said “typescript”. It’s always cute when script kiddies call themselves engineers or use words like “architecture”.

avgDev

This is an unhinged comment. You should take a deep breath and get off the internet. You sound extremely immature calling someone on HN "script kiddie".

wutwutwat

What do you plan to do after your software career is over?

Bukhmanizer

> the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers

I have a strong opinion that AI will boost the importance of people with “special knowledge” more than anyone else regardless of role. So engineers with deep knowledge of a system or PMs with deep knowledge of a domain.

jfim

> I guess the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers

I'd argue that they can't, at least on a short timeframe. Not because LLMs can't generate a program or product that works, but that there needs to be enough understanding of how the implementation works to fix any complex issues that come up.

One experience I had is that I had tried to generate a MITM HTTPS proxy that uses Netty using Claude, and while it generated a pile of code that looked good on the surface, it didn't actually work. Not knowing enough about Netty, I wasn't able to debug why it didn't work and trying to fix it with the LLM didn't help either.

Maybe PMs can pick up enough knowledge over time to be able to implement products that can scale, but by that time they'd effectively be a software engineer, minus the writing code part.

ambicapter

LLMs are great for learning though, you can easily ask them questions, and you can evaluate your understanding every step of the way, and gradually build the accuracy of your world model that way. It’s not uncommon for me to ask a general question, drill deeper into a concept, and then either test things manually with some toy code or end up reading the official documentation, this time with at least some exposure to the words that I’m looking for to answer my question.

sodaclean

This is how I use them- but I also use them to write initial UI's (usually very primitive). Because I've got an issue where the UI has to be perfect, and if I can blame somebody/something other than me I can ignore it until the UI becomes important enough.

o11c

If I wanted a confident and simple answer with no regard for veracity, I would just ask a politician.

pron

> LLMs are remarkably good at writing code.

Remarkably good compared to what? Their ability is undoubtedly incredible compared to what we've come to expect computers to do but, while that ability is truly spectacular given our past expectations, I'm not sure it's good enough for what we need, i.e. consistently, reliably, and substantially improve programming productivity (I mean, they sometimes substantially improve productivity, but not consistently and reliably).

IanCal

I disagree. Unless you’re focussed on right now, in which case case… maybe? Depends on scale.

I have a few scattered thoughts here but I think you’re caught up on how things are done now.

A human expert in a field is the customer.

Do you think, say, gpt5 pro can’t talk to them about a problem and what’s reasonable to try and build in software?

It can build a thing, with tests, run stuff and return to a user.

It can take feedback (talking to people is the key major things LLMs have solved).

They can iterate (see: codex) deploy and they can absolutely write copy.

What do you really think in this list they can’t do?

For simplicity reduce it to a relatively basic crud app. We know that they can make these over several steps. We know they can manage the ui pretty well, do incremental work etc. What’s missing?

I think something huge here is that some of the software engineering roles and management become exceptionally fast and cheap. That means you don’t need to have as many users to be worthwhile writing code to solve a problem. Entirely personal software becomes economically viable. I don’t need to communicate value for the problem my app has solved because it’s solved it for me.

Frankly most of the “AI can’t ever do my thing” comments come across as the same as “nobody can estimate my tasks they’re so unique” we see every time something comes up about planning. Most business relevant SE isn’t complex logically, interestingly unique or frankly hard. It’s just a different language to speak.

Disclaimer: a client of mine is working on making software simpler to build and I’m looking at the AI side, but I have these views regardless.

colordrops

Once all the context that a typical human engineer has to "build software" is available to the LLM, I'm not so sure that this statement will hold true.

bloppe

But it's becoming increasingly clear that LLMs based on the transformer model will never be able to scale their context much further than the current frontier, due mainly to context rot. Taking advantage of greater context will require architectural breakthroughs.

Calamityjanitor

I feel you can apply this to all roles. When models passed highschool exam benchmarks, some people talked as if that made the model equivalent to a person passing highschool. I may be wrong, but I bet even an state of the art LLM couldn't complete high school. You have to do things like attending classes at the right time/place, take initiative, keep track of different classes. All of the bigger picture thinking and soft skills that aren't in a pure exam.

Improving this is what everyone's looking into now. Even larger models, context windows, adding reasoning, or something else might improve this one day.

jumploops

I've been forcing myself to "pure vibe-code" on a few projects, where I don't read a single line of code (even the diffs in codex/claude code).

Candidly, it's awful. There are countless situations where it would be faster for me to edit the file directly (CSS, I'm looking at you!).

With that said, I've been surprised at how far the coding agents are able to go[0], and a lot less surprised about where I need to step in.

Things that seem to help: 1. Always create a plan/debug markdown file 2. Prompt the agent to ask questions/present multiple solutions 3. Use git more than normal (squash ugly commits on merge)

Planning is key to avoid half-brained solutions, but having "specs" for debug is almost more important. The LLM will happily dive down a path of editing as few files as possible to fix the bug/error/etc. This, unchecked, can often lead to very messy code.

Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.

I now basically commit every time a plan or debug step is complete. I've tried having the LLM control git, but I feel that it eats into the context a bit too much. Ideally a 3rd party "agent" would handle this.

The last thing I'll mention is that Claude Code (Sonnet 4.5) is still very token-happy, in that it eagerly goes above and beyond when not always necessary. Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault. For both cases, this is where planning up-front is super useful.

[0]Caveat: the projects are either Typescript web apps or Rust utilities, can't speak to performance on other languages/domains.

svachalek

Also, put heavy lint rules in place, and commit hooks to make sure everything compiles, lints, passes tests, etc. You've got to be super, super defensive. But Claude Code will see all those barriers and respond to them automatically which saves you the trouble of being vigilant over so many little things. You just need to watch the big picture, like make sure tests are there to replicate bugs, new features are tested, etc, etc.

throwaway314155

> Candidly, it's awful.

Noting your caveat but I’m doing this with Python and your experience is very different from mine.

tharkun__

    Always create a plan/debug markdown file
Very much necessary. Especially with Claude I find. It auto-compacts so often (Sonnet 4.5) and it instantly goes a-wall stupid after that. I then make it re-read the markdown file, so we can actually continue without it forgetting about 90% of what we just did/talked about.

    Prompt the agent to ask questions/present multiple solutions
I find that only helps marginally. They all output so much text it's not even funny. And that's with one "solution".

I don't get how people can stand reading all that nonsense they spew, especially Claude. Everything is insta-ready to deploy, problem solved, root cause found, go hit the big red button that might destroy the earth in a mushroom cloud. I learned real fast to only skim what it says and ignore all that crap (as in I never tried to "change its personality" for real - I did try to tell it to always use the scientific method and prove its assumptions but just like a junior dev it never does and just tells me stupid things it believes to be true and I have to question it. Again, just like a junior dev, but it's my junior dev that's always on and available when I have time and it does things while I do other stuff. And instead of me having to ask the junior after and hour or two what rabbit hole it went down and get them out of there, Claude and Codex usually visually ping the terminal before I even have time to notice. That's for when I don't have full time focus on what I'm trying to do with the agents, which is why I do like using them.

The times when I am fully attentive, they're just soooo slow. And many many times I could do what they're doing faster or just as fast but without spending extra money and "environment". I've been trying to "only use AI agents for coding" for like a month or two now to see its positives and limitations and form my own opinion(s).

    Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.
I find Claude's "Plan mode" is actually ideal. I just enable it and I don't have to tell it anything. While Codex "breaks out" from time to time and just starts coding even when I just ask it a question. If these machines ever take over, there's probably some record of me swearing at them and I will get a hitman on me. Unlike junior devs, I have no qualms about telling a model that it again ignored everything I told it.

    Ideally a 3rd party "agent" would handle this.
With sub-agents you can. Simple git interactions are perfect for subagents because not much can get lost in translation in the interface between the main agent and the sub agent. Then again, I'm not sure how you loose that much context. I rather use a sub agent for things like running the tests and linter on the whole project in the final steps, which spew a lot of unnecessary output.

Personally, I had a rather bad set of experiences with it controlling git without oversight, so I do that myself, since doing it myself is less taxing than approving everything it wants to do (I automatically allow Claude certain commands that are read only for investigations and reviewing things).

pron

> I don’t really know why AI can't build software (for now)

Could be because programming involves:

1. Long chains of logical reasoning, and

2. Applying abstract principles in practice (in this case, "best practices" of software engineering).

I think LLMs are currently bad at both of these things. They may well be among the things LLMs are worst at atm.

Also, there should be a big asterisk next to "can write code". LLMs do often produce correct code of some size and of certain kinds, but they can also fail at that too frequently.

orliesaurus

Software engineering has always been about managing complexity, not writing code. Code is just the artifact. No-code, low-code is all code but doesn't make for a good software engineered application

eterm

I've been experimenting with a little vibe coding.

I've generally found the quality of .NET to be quite good. It trips up sometimes when linters ping it for rules not normally enforced, but it does the job reasonably well.

The front-end javascript though? It's both an absolute genuis and a complete menace at the same time. It'll write reams of code to gets things just right but with no regards to human maintainability.

I lost an entire session to the fact that it cheerfully did:

    npm install fabric
    npm install -D @types/fabric
Now that might look fine, but a human would have realised that the typings library is a completely different out-dated API, the package last updated 6 years ago.

Claude however didn't realise this, and wrote a ton of code that would pass unit tests but fail the type check. It'd check the type checker, re-write it all to pass the type checker, only for it now to fail the unit tests.

Eventually it semi-gave up typing and did loads of (fabric as any) all over the place, so now it just gave runtime exceptions instead.

I intervened when I realised what it was doing, and found the root cause of it's problems.

It was a complete blindspot because it just trusted both the library and the typechecker.

So yeah, if you want to snipe a vibe coder, suggest installing fabricjs with typings!

teaearlgraycold

Although - at least for simple packages - I've found LLMs good at extracting type definitions from untyped libraries.

subtlesoftware

True for now because models are mainly used to implement features / build small MVPs, which they’re quite good at.

The next step would be to have a model running continuously on a project with inputs from monitoring services, test coverage, product analytics, etc. Such an agent, powered by a sufficient model, could be considered an effective software engineer.

We’re not there today, but it doesn’t seem that far off.

bcrosby95

> The next step would be to have a model running continuously on a project with inputs from monitoring services, test coverage, product analytics, etc. Such an agent, powered by a sufficient model, could be considered an effective software engineer.

Building an automated system that determines if a system is correct (whatever that means) is harder to build than the coding agents themselves.

bloppe

> We’re not there today, but it doesn’t seem that far off.

What time frame counts as "not that far off" to you?

If you tried to bet me that the market for talented software engineers would collapse within the next 10 years, I'd take it no question. 25 years, I think my odds are still better than yours. 50 years, I might not take the bet.

subtlesoftware

Great question. It depends on the product. For niche SaaS products, I’d say in the next few years. For like Amazon.com, on the order of decades.

bloppe

If the niche SaaS product never required a talented engineer in the first place, I'd be inclined to agree with you. But even a niche SaaS product requires a decent amount of engineering skill to maintain well.

thomasfromcdnjs

Agreed.

I've played around with agent only code bases (where I don't code at all), and had an agent hooked up to server logs, which would create an issue when it encounters errors, and then an agent would fix the tickets, push to prod and check deployment statuses etc. Worked good enough to see that this could easily become the future. (I also had it claude/codex code that whole setup)

Just for semantic nitpicking, I've zero shot heaps of small "software" projects that I use then throw away. Doesn't count as a SAAS product but I would still call it software.

bloppe

The article "AI can code, but it can't build software"

An inevitable comment: "But I've seen AI code! So it must be able to build software"

pil0u

I agree that tooling is maturing towards that end.

I wonder if that same non-technical person that built the MVP with GenAI and requires a (human) technical assistance today, will need it tomorrow as well. Will the tooling be mature enough and lower the barrier enough for anyone to have a complete understanding about software engineering (monitoring services, test coverage, product analytics)?

jahbrewski

I’ve heard “we’re not there today, but it doesn’t seem that far off” since the beginning of the AI infatuation. What if, it is far off?

bloppe

It's telling to me that nobody who actually works in AI research thinks that it's "not that far off".

zeckalpha

I think this can be extended (but not necessarily fully mitigated) by working with non-SWE agents interacting with the same codebase. Drafting product requirements, assess business opportunities, etc. can be done by LLMs.

hamasho

The problem with vibe coding is it demoralizes experienced software engineers. I'm developing a MVP with vibes and Claude Code and Codex output work in many cases for this relatively new project. But the quality of code is bad. There is already duplicated or unused logic, a lot of code is unnecessarily complex (especially React and JSX). And there's little PR reviews so that "we can keep velocity". I'm paying much less attention for quality now. After all, why bother when AI produce working code? I can't justify and don't have energy for deep-diving system design or dozens of nitpicking change requests. And it makes me more and more replaceable by LLM.

bloppe

> I'm paying much less attention for quality now. After all, why bother when AI produce working code?

I hear this so much. It's almost like people think code quality is unrelated to how well the product works. As though you can have 1 without the other.

If your code quality is bad, your product will be bad. It may be good enough for a demo right now, but that doesn't mean it really "works".

krackers

Because there's a notion that if any bugs are discovered later on, they can just "be fixed". And generally unless you're the one fixing the bugs, it's hard to understand the asymmetry in effort here. No one also ever got any credit for bug-fixes compared to adding features.

carlosjobim

> If your code quality is bad, your product will be bad.

Why? Modern hardware power allow for extremely inefficient code, so even if some code runs a thousand times slower because it's badly programmed it will still be so fast that it seems instant.

For the rest of the stuff, it has no relevance for the user of the software what the code is doing inside of the chip, as long as the inputs and outputs function as they should. User wants to give input and receive output, nothing else has any significance at all for her.

bradfa

The context windows are still dramatically too small and the models aren’t yet seeming to train on how to build maintainable software. There is a lot less written down about how to do this on the public web. There’s a bunch of high level public writing but not may great examples of real world situations that happen on every proprietary software project, because that’s very messy data locked away internal to companies.

I’m sure it’ll improve over time but it won’t be nearly as easy as making ai good at coding.

AnimalMuppet

In fairness, there's a lot more "software" than there is "maintainable software" in their training data...

CMCDragonkai

Many human devs can code, but few can build software.

ergocoder

Yeah, just like many software engineers. AI has achieved software engineering.