Skip to content(if available)orjump to list(if available)

My LLM codegen workflow

My LLM codegen workflow

161 comments

·February 18, 2025

briga

Absolutely LLMs are great for greenfield projects. They can get you to a prototype for a new idea faster than any tool yet invented. Where they start to break down, I find, is when you ask it to make changes/refactors to existing code and mature projects. They usually lack context, so they doesn't hesitate to introduce lots of extra complexity, add frameworks you don't need, and in general just make the situation worse. Or if they get you to some solution it will have taken so long that you might as well have just done the heavy lifting yourself. LLMs are still no substitute for actually understanding your code.

wilkystyle

100% agree.

My experience to date across the major LLMs is that they are quick to leap to complex solutions, and I find that the code often is much harder to maintain than if I were to do it myself.

But complex code is only part of the problem. Another huge problem I see is the rapid accumulation of technical debt. LLMs will confidently generate massive amounts of code with abstractions and design patterns that may be a good fit in isolation, but are absolutely the wrong pattern for problem you're trying to solve or the system you're trying to build. You run into the "existing code pattern" problem that Sandi Metz talked about in her fantastic 2014 RailsConf talk, "All the little things" [0]:

> "We have a bargain to follow the pattern, and if the pattern is a good one then the code gets better. If the pattern is a bad one, then we exacerbate the problem."

Rapidly generating massive amounts of code with the wrong abstractions and design patterns is insidious because it feels like incredible productivity. You see it all the time in posts on e.g. Twitter or LinkedIn. People gushing about how quickly they are shipping products with minimal to zero other humans involved. But there is no shortcut to understanding or maintainability if you care about building sustainable software for the medium to long-term.

EDIT: Forgot to add link

[0] https://www.youtube.com/watch?v=8bZh5LMaSmE&t=8m11s

williamcotton

But why follow the wrong abstraction and why try to build something that you don't fundamentally understand?

I've built some rather complex systems:

Guish, a bi-directional CLI/GUI for constructing and executing Unix pipelines: https://github.com/williamcotton/guish

WebDSL, fast C-based pipeline-driven DSL for building web apps with SQL, Lua and jq: https://github.com/williamcotton/webdsl

Search Input Query, a search input query parser and React component: https://github.com/williamcotton/search-input-query

jdlshore

I'm not trying to throw shade when I say this: those codebases are very small. (I'm assuming what I found in the src/ directories is their code.) Working in large codebases is a different kind of experience than working in a small codebase. It's no longer possible to keep the whole system in mind, to keep the dozens+ people working on it in sync, or keep up to date with all the changes being made. In that environment, consistency is a useful mechanism to keep things under control, although it can be overused.

jrvarela56

I would recommend everyone reading this to think of it as a skill issue. You can learn to use the LLM/agent to document your code base, test isolated components and refactor your spaghetti into modular chunks easily understandable by the agent.

The greenfield projects turn into a mess very quick because if you let it code without any guidance (wrt documentation, interactivity, testability, modularity) it generates crap until you can't modify it. The greenfield project turns into legacy as fast as the agent can spit out new code.

ukuina

> turns into legacy as fast as the agent can spit out new code.

This is an important point. Unconstrained code generation lets you witness accelerated codebase aging in real-time.

HPsquared

LLM coding is quite well-suited to projects that apply the Unix philosophy. Highly modular systems that can be broken into small components that do one thing well.

StableAlkyne

I've found modularity to be helpful in general whether there's an LLM or not.

Easier to test, lower cognitive overload, and it's faster to onboard someone when they only need to understand a small part at a time.

I almost wonder if these LLMs can be used to assess the barrier for onboarding. If it gets confused and generating shitty suggestions, I wonder if that could that be a good informal smoke alarm for trouble areas the next junior will run into.

infecto

You are right but that's also a good indication that the codebase itself is too complex and at a certain size / scale, too much for a human to reason over even or even if something a human could do, is not efficient doing so.

You should not be structuring the code for a LLM alone but I have found that trying to be very modular has helped both my code as well as the ability to utilize LLM on top of it.

sejje

You can use LLMs and actually understand your code.

y1n0

I agree, but where I run into problems is my existing projects are large. In the last couple weeks I’ve had two cases where I really wanted AI help but I couldn’t fit my stuff in the 128k context window.

These are big legacy projects where I didn’t write the code to begin with, so having an AI partner would have been really nice.

jrexilius

The first part of this, where you told it to ask YOU questions, rather than laboriously building prompts and context yourself was the magic ticket for me. And I doubt I would have stumbled on that sorta inverse logic on my own. Really great write up!

danphilibin

This is the key to a lot of my workflows as well. I'll usually tack some form of "ask me up to 5 questions to improve your understanding of what I'm trying to do here" onto the end of my initial messages. Over time I've noticed patterns in information I tend to leave out which has helped me improve my initial prompts, plus it often gets me thinking about aspects I hadn't considered yet.

daxfohl

Frankly getting used to doing this may help our communication with other engineers as well.

fragmede

promo from L5->L7 confirmed.

treetalker

Indeed!

The example prompts are useful. They not only reduced the activation energy required for me to start installing this habit in my personal workflows, but also inspired the notion that I can build a library of good prompts and easily implement them by turning them into TextExpander snippets.

P.S.: Extra credit for the Insane Clown Posse reference!

nijaru

I add something like “ask me any clarifying questions” to my my initial prompts. For larger requests, it seems to start a dialogue of refinement before providing solutions.

theturtle32

Can confirm, this is an excellent tactic when working with LLMs!

CamperBob2

That's one of the key wins with o1-pro's deep research feature. The first thing it tends to do after you send a new prompt is ask you several questions, and they tend to be good ones.

One idea I really like here is asking the model to generate a todo list.

bcoates

That lonely/downtime section at the end is a giant red flag for me.

It looks like the sort of nonproductive yak-shaving you do when you're stuck or avoiding an unpleasant task--coasting, fooling around incrementally with your LLM because your project's fucked and you psychologically need some sense of progress.

The opposite of this is burnout--one of the things they don't tell you about successful projects with good tools is they induce much more burnout than doomed projects. There's a sort of Amdahl's Law in effect, where all the tooling just gives you more time to focus on the actual fundamentals of the product/project/problem you’re trying to address, which is stressful and mentally taxing even when it works.

Fucking around with LLM coding tools, otoh, is very fun, and like constantly clean-rebuilding your whole (doomed) project, gives you both some downtime and a sense of forward momentum--look how much the computer is chugging!

The reality testing to see if the tool is really helping is to sit down with a concrete goal and a (near) hard deadline. Every time I've tried to use an LLM under these conditions it just fails catastrophically--I don't just get stuck, I realize how basically every implicit decision embedded in the LLM output has an unacceptably high likelihood of being wrong, and I have an amount of debug cycles ahead of me exceeding the time to throw it all away and do it without the LLM by, like, an order of magnitude.

I'm not an LLM-coding hater and I've been doing AI stuff that's worked for decades, but current offerings I've tried aren't even close to productive compared to searching for code that already exists on the web.

khqc

I guess you're not a big fan of rubber duck debugging then? Whenever I get stuck I like to ask myself a bunch of questions and thought experiments to get a better understanding of the problem/project, and with LLMs I'm forced to spell out each one of these questions/experiments coherently, which ends up being great documentation later on. I think LLMs are great if you're actually interested in the fundamentals of your problem/project, otherwise it just turns into a sinkhole that sucks you in.

getnormality

It sounds like LLMs are the new futzing with emacs configuration.

wilkystyle

Old and busted: Futzing around with my Emacs configuration. New hotness: Having an LLM do it for me.

krupan

Seriously!! Coding with LLM's is marketed as a huge time saver, but every time I've tried, it hasn't been. I'm told I just need to put in the time (ironic, no?) to learn to use the LLM properly. Why don't I just use that time to learn to write code better myself?

anon7000

It’s not really ironic. You could spend a couple hours making yourself twice as good as using AI tools, or a couple hours making yourself like .1% of a better programmer, assuming you’re not banging your head against the wall anyways.

It’s one of those things where a little upskilling can make a big impact. So many things in life need a bit of practice before they’re useful to you.

For starters, you need to change the default prompt in your editor to make it do what you want. If it does something annoying or weird, put it in the prompt to not take that approach. For me, that was absurdly long, useless explanations. And now it’s short and sweet.

mdrzn

Seriously!! Cars are marketed as a huge time saver, but every time I’ve tried one, they haven’t been. I’m told I just need to put in the time (ironic, no?) to learn to drive properly. Why don’t I just use that time to train my legs and run faster instead?

krupan

I think the difference here is it is not at all obvious to me that an LLM is a force multiplier on same the order as cars to legs.

Cars are pretty easy to observe in action doing what they promise to do. Driving a car is a very straightforward, mechanical, repeatable, intuitive operation.

Working with an LLM is not repeatable or straightforward.

I'm short, your analogy is not helping me

triyambakam

It's more like waiting for the code to compile (or node_modules to install before npm improved)

flir

> constantly clean-rebuilding your whole (doomed) project, gives you both some downtime and a sense of forward momentum

ouch. You've thought about this, haven't you? Your ideas are intriguing to me, and I wish to subscribe to your newsletter.

rotcev

This is the first article I’ve come across that truly utilizes LLMs in a workflow the right way. I appreciate the time and effort the author put into breaking this down.

I believe most people who struggle to be productive with language models simply haven’t put in the necessary practice to communicate effectively with AI. The issue isn’t with the intelligence of the models—it’s that humans are still learning how to use this tool properly. It’s clear that the author has spent time mastering the art of communicating with LLMs. Many of the conclusions in this post feel obvious once you’ve developed an understanding of how these models "think" and how to work within their constraints.

I’m a huge fan of the workflow described here, and I’ll definitely be looking into AIder and repomix. I’ve had a lot of success using a similar approach with Cursor in Composer Agent mode, where Claude-3.5-sonnet acts as my "code implementer." I strategize with larger reasoning models (like o1-pro, o3-mini-high, etc.) and delegate execution to Claude, which excels at making inline code edits. While it’s not perfect, the time savings far outweigh the effort required to review an "AI Pull Request."

Maximizing efficiency in this kind of workflow requires a few key things:

- High typing speed – Minimizing time spent writing prompts means maximizing time generating useful code.

- A strong intuition for "what’s right" vs. "what’s wrong" – This will probably become less relevant as models improve, but for now, good judgment is crucial.

- Familiarity with each model’s strengths and weaknesses – This only comes with hands-on experience.

Right now, LLMs don’t work flawlessly out of the box for everyone, and I think that’s where a lot of the complaints come from—the "AI haterade" crowd expects perfection without adaptation.

For what it’s worth, I’ve built large-scale production applications using these techniques while writing minimal human code myself.

Most of my experience using these workflows has been in the web dev domain, where there's an abundance of training data. That said, I’ve also worked in lower-level programming and language design, so I can understand why some people might not find models up to par in every scenario, particularly in niche domains.

brokencode

> “I appreciate the time and effort the author put into breaking this down.”

Let’s be honest. The author was probably playing cookie clicker while this article was being written.

rd

Has anyone who evolved from a baseline of just using Cursor chat and freestyling to a proper workflow like this got any anecdata to share on noticeable improvements?

Does the time invested into the planning benefit you? Have you noticed less hallucinations? Have you saved time overall?

I’d be curious to hear because my current workflow is basically

1. Have idea

2. create-next-app + ShadCN + TailwindUI boilerplate

3. Cursor Composer on agent mode with Superwispr voice transcription

I’m gonna try the author’s workflow regardless, but would love to hear others opinions.

ghuntley

If you steer it and build a stdlib, you get better outcomes. See https://ghuntley.com/stdlib

MarkMarine

I’ve been following this, my workflow doesn’t us cursor (VS Code descendants just aren’t my preference) but I’ve built your advice into my home made system using emacs and gptel. I keep a style guide that is super detailed for each language and project, and now I’ve been building the stdlib you recommended. It’s great, thanks for writing this!

ghuntley

no problem <3

fragmede

> I'm hesitant to give this advice away for free

With all the layoffs in our sector, I wouldn't blame you if you didn't share it, so thank you for sharing.

margalabargala

Yeah, don't they know how to hustle? I bet they're still asleep at 5am.

Seriously, though, it's really sad that not trying to profit off a discussion of industry tooling is something someone has to "push through".

risyachka

How does it help and not make it worse when it comes to layoffs?

e12e

Looks like 70% of those rules would benefit from being shared, just like dot files and editor configs.

mike_hearn

Aider + AI generated maps and user guides for internal modules has worked well for me. Just today I did my own version of a script that uses Gemini 2 Flash (1M context window) to generate maps of each module in my codebase, i.e. a short one or two sentence description of what's in every file. Aider's repo maps don't work well for me, so I disable them, and I think this will work better.

I also have a scratchpad file that I tell the model it can update to reflect anything new it learns, so that gives it a crude form of memory as it works on the codebase. This does help it use internal utility APIs.

manmal

LLMs forcing us to improve our documentation habits. Seriously though, many languages allow API doc generation out of comments. Maybe these docs can just be flattened into a file.

mike_hearn

Yes sort of. This particular codebase is a mix of Java and Kotlin, and all my internal code is documented with proper Javadocs/KDocs already since years, just for myself and other people I work with. That's partly why Gemini can make such accurate maps.

The problem isn't a lack of docs but rather birds-eye context: even with models that allow huge context windows and are fast, you can drown a model in irrelevant stuff and it's expensive. I'm still with Claude 3.5 for coding and its window is large but not unlimited. You really don't want to add a bunch of source files _and_ the complete API docs for tens of thousands of classes into every prompt, not unless you like waiting whilst money burns and getting problems due to the model getting distracted.

It's also just wasteful, docs contain a lot of redundancy and stuff the model can guess. If you ask models to make notes about only the surprising stuff, it's a form of compression that lets you make smaller prompts.

Aider provides a quick fix because it's easy to control what files are in the context. But to 'level up' I need to let the AI find and add files itself. Aider can do this: it gives the model tools for requesting files to be added to the chat. And in theory, Aider computes a PageRank over symbols and symbolic references to find the most important stuff in the repository and computes a map that's prepended to the prompt so the model knows what to ask for. In practice for reasons I don't understand, Aider's repo maps in this project are full of random useless stuff. Maybe it works better for Python.

Finding the right way to digest codebases is still an open problem. I haven't tried RAG, for instance. If things are well abstracted it in theory shouldn't be needed.

throwup238

Indexing automatically generated API docs using Cursor seems to work very well. I also index any guides/mdbooks libraries have available, depending on whether I’m trying to implement something new or modifying existing code.

orsenthil

> Aider + AI generated maps and user guides

How do do that? Especially the AI generated map?

mike_hearn

I have a custom script. It selects all the source files, strips any license headers and concatenates them like this:

    <source_file name="foo/bar/Baz.java">
    ...
    </source_file>
It then chunks them to fit within model context window limits, sends it to the LLM with a system prompt that asks it to summarize each file in a compact way, and writes the result back out to the tree.

The ugly XML tag is to avoid conflicts. Some other scripts try to make a Markdown document of the tree which is silly because your tree is quite likely to contain Markdown already, and so it's confusing for the model to see ``` that doesn't really terminate the block. Using a marker pattern that's unlikely to occur in your code fixes that.

dimitri-vs

Yes, and then I keep going back to the basics:

- small .cursorrules file explaining what I am trying to build and why at a very high level and my tech stack

- a DEVELOPMENT.md file which is just a to-do/issue list for me that I tell cursor to update before every commit

- a temp/ directory where I dump contextual md and txt files (chat logs discussing feature, more detailed issue specs, etc.)

- a separate snippet management app that has my commonly used request snippets (write commit message, ask me clarify questions, update README, summarize chat for new session, etc.)

Otherwise it's pretty much what your workflow is.

cynicalpeace

I'm wondering the same thing.

Most of these workflows are just context management workflows and in Cursor it's so simple to manage the context.

For large files I just highlight the code and cmd+L. For short files, I just add them all by using /+downarrow

I constantly feed context like this and then usually come to a good solution for both legacy and greenfield features/products.

If I don't come to a good solution it's almost always because I didn't think through my prompt well enough and/or I didn't provide the correct context.

bambax

This is all fine for a solo dev, but how does this work with a team / squad, working on the same code base?

Having 7 different instances of an LLM analyzing the same code base and making suggestions would not just be economically wasteful, it would also be unpractical or even dangerous?

Outside of RAG, which is a different thing, are there products that somehow "centralize" the context for a team, where all questions refer to the same codebase?

sambo546

I've started substituting "human" for "LLM" when I read posts like these. Is having 7 different humans analyzing the same code base any less wasteful?

bambax

They are not analyzing the same code base, they are all contributing to the same code base, each in their own domain. It would seem relevant that any advice an LLM gives to one of them is kept consistent -- in real time -- with any other advice to any other dev, instead of having to wait for each commit or push.

staindk

I've only recently switched to Cursor so am not clued up about everything, but they mention that the embedded indexing they do on your code is shared with others (others who have access to that repository? Unsure).

It did seem to take a while to index, even though my colleagues had been using Cursor for a while, so I'm likely misunderstanding something.

rollinDyno

Something I quickly learned while retooling this past week is that it’s preferable not to add opinionated frameworks to the project as they increase the size of the context the model should be aware of. This context will also not likely be available in the training data.

For example, rather than using Plasmo for its browser extension boilerplate and packaging utilities, I’ve chosen to ask the LLM to setup all of that for me as it won’t have any blindspots when tasked with debugging.

sampton

The end of artisan frameworks - probably for the better.

balls187

It's likely the end of a lot of abstractions that made programming easier.

At some point, specialized code-gen transformer models should get really good at just spitting out the lowest level code required to perform the job.

yoz

Disagree. Some abstractions are still vital, and it's for the same reasons as always: communicate purpose and complexity concisely rather than hiding it.

The best code is that which explains itself most efficiently and readably to Whoever Reads It Next. That's even more important with LLMs than with humans, because the LLMs probably have far less context than the humans do.

Developers often fall back on standard abstraction patterns that don't have good semantic fit with the real intent. Right now, LLMs are mostly copying those bad habits. But there's so much potential here for future AI to be great at creating and using the right abstractions as part of software that explains itself.

hy4000days

This.

Future programming language designers are then answering questions like:

"How low-level can this language be while considering generally available models and hardware available can only generate so many tokens per second?",

"Do we have the language models generate binary code directly, or is it still more efficient time-wise to generate higher level code and use a compiler?"

"Do we ship this language with both a compiler and language model?"

"Do we forsake code readability to improve model efficiency?"

bee_rider

Surely no respectable professional would just ship code they don’t understand, right? So the LLM should probably spit out code in reasonably well known languages using reasonably well known libraries and other abstractions…

fastball

It's not just frameworks – I noticed this recently when starting a new project and utilizing EdgeDB. They have their own Typescript query builder, and [insert LLM] cannot write correct constructions with that query builder to save its life.

tarkin2

Most new programmers forget the specification and execution plan part of programming.

I ended up finishing my side projects when I kept these in mind, rather than focusing on elegant code for elegant code's sake.

It seems the key to using LLMs successfully is to make them create a specification and execution plan, through making them ask /you/ questions.

If this skill--specification and execution planning--is passed onto LLMs, along with coding, then are we essentially souped-up tester-analysts?

fullstackwife

Looks similar to my experience, except this part:

> if it doesn’t work, Q&A with aider to fix

I fix errors myself, because LLMs are capable of producing large chunks of really stupid/wrong code, which needs to be reverted, and thats why it makes sense to see the code at least once.

Also I used to find myself in a situation when I tried to use LLM for the sake of using LLM to write code (waste of time)

codeisawesome

Would be great if there were more details on the costs of doing this work - especially when loading lots of tokens of context via repo mix and then generating code with context (context-loaded inference API calls are more expensive, correct?). A dedicated post discussing this and related considerations would be even better. Are there cost estimations in the tools like aider (vs just refreshing the LLM platform’s billing dashboard?)

Isamu

I’m curious, is adding “do not hallucinate” to prompts effective in preventing hallucinations? The author does this.

watt

It will work - you can see it well with a Chain of Thought (CoT) model: it will keep asking itself: "am I hallucinating? let's double check" and then will self-reject thoughts if it can't find a proper grounding. In fact, this is the best part of CoT model, that you can see where it goes off rails and can add a message to fix it in the prompt.

For example, there is this common challenge, "count how many r letters in strawberry", and you can see the issue is not counting, but that model does not know if "rr" should be treated as single "r" because it is not sure if you are counting r "letters" or r "sounds" and when you sound out the word, there is a single "r" sound where it is spelled with double "r". so if you tell the model, double "r" stands for 2 letters, it will get it right.

simonw

Apple were using that in their Apple Intelligence system prompts last year, I don't know if they still have that in there. https://simonwillison.net/2024/Aug/6/apple-intelligence-prom...

I have no idea if it works or not!

harper

I added it because of the apple prompts! I figured it is worth a try. The results are good, but i did not test it extensively

becquerel

I don't know about this specific technique, but I have found it useful to add a line like 'it's OK if you don't know or this isn't possible' at the end of queries. Otherwise LLMs have a tendency to tilt at whatever windmill you give them. Managing tone and expectations with them is a subtle but important art.

krainboltgreene

It seems absurd, but I suppose it’s the same as misspelling with similar enough trigrams as to get the best autocorrect results.

ggulati

Nice, I coincidentally wrote a blog post today exploring workflows as well: https://ggulati.wordpress.com/2025/02/17/cursorai-for-fronte...

Your workflow is much more polished, will definitely try it out for my next project

fragmede

> paste in prompt into claude copy and paste code from claude.ai into IDE

is more polished? What's your workflow, banging rocks together?

ggulati

More or less, I tried out Cursor for the first time a week ago. So very much in the newbie stage and looking to learn

shoemakersteve

This made me laugh audibly. Thank you.

harper

let me know how it works!

hnuser123456

Looks like your blog crashed, I've been wanting to read it

hnuser123456

Thank you for fixing it