I Read All of Cloudflare's Claude-Generated Commits

122 comments

·June 6, 2025

SrslyJosh

> Reading through these commits sparked an idea: what if we treated prompts as the actual source code? Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.

Please god, no, never do this. For one thing, why would you not commit the generated source code when storage is essentially free? That seems insane for multiple reasons.

> When models inevitably improve, you could connect the latest version and regenerate the entire codebase with enhanced capability.

How would you know if the code was better or worse if it was never committed? How do you audit for security vulnerabilities or debug with no source code?

gizmo686

My work has involved a project that is almost entirely generated code for over a decade. Not AI generated, the actual work of the project is in creating the code generator.

One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable. The nature of reviewing changes is just too different between them.

Another thing we learned very quickly was that attempting to generate code, then modify the result is not sustainable; nor is aiming for a 100% generated code base. The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.

Another thing we learned is that any change in the code generator needs to have a feature flag, because someone was relying on the old behavior.

saagarjha

I think the biggest difference here is that your code generator is probably deterministic and you likely are able to debug the results it produces rather than treating it like a black box.

mschild

> One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable.

Keeping a repository with the prompts, or other commands separate is fine, but not committing the generated code at all I find questionable at best.

diggan

If you can 100% reproduce the same generated code from the same prompts, even 5 years later, given the same versions and everything then I'd say "Sure, go ahead and don't saved the generated code, we can always regenerate it". As someone who spent some time in frontend development, we've been doing it like that for a long time with (MB+) generated code, keeping it in scm just isn't feasible long-term.

But given this is about LLMs, which people tend to run with temperature>0, this is unlikely to be true, so then I'd really urge anyone to actually store the results (somewhere, maybe not in scm specifically) as otherwise you won't have any idea about what the code was in the future.

djtango

I didn't read it as that - If I understood correctly, generated code must be quarantined very tightly. And inevitably you need to edit/override generated code and the manner by which you alter it must go through some kind of process so the alteration is auditable and can again be clearly distinguished from generated code.

Tbh this all sounds very familiar and like classic data management/admin systems for regular businesses. The only difference is that the data is code and the admins are the engineers themselves so the temptation to "just" change things in place is too great. But I suspect it doesn't scale and is hard to manage etc.

saagarjha

I feel like using a compiler is in a sense a code generator where you don't commit the actual output

cimi_

I will guess that you are generating orders of magnitude more lines of code with your software than people do when building projects with LLMs - if this is true I don't think the analogy holds.

skywhopper

There’s a huge difference between deterministic generated code and LLM generated code. The latter will be different every time, sometimes significantly so. Subsequent prompts would almost immediately be useless. “You did X, but we want Y” would just blow up if the next time through the LLM (or the new model you’re trying) doesn’t produce X at all.

Xelbair

Worse. Models aren't deterministic! They use temperature value to control randomness, just so they can escape local minima!

Regenerated code might behave differently, have different bugs(worst case), or not work at all(best case).

chrishare

Nitpick - it's the ML system that is sampling from model predictions that has a temperature parameter, not the model itself. Temperature and even model aside, there are other sources of randomness like the underlying hardware that can cause the havoc you describe.

rectang

>> what if we treated prompts as the actual source code?

You would not do this because: unlike programming languages, natural languages are ambiguous and thus inadequate to fully specify software.

squillion

Exactly!

> this assumes models can achieve strict prompt adherence

What does strict adherence to an ambiguous prompt even mean? It’s like those people asking Babbage if his machine would give the right answer when given the wrong figures. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a proposition.

a012

Prompts are like story on the board, and like engineers, depends on the understanding of the model the generated source code can vary. Saying the prompts could be the actual code is so wrong and dangerous thought

null

[deleted]

fastball

The idea as stated is a poor one, but a slight reshuffling and it seems promising:

You generate code with LLMs. You write tests for this code, either using LLMs or on your own. You of course commit your actual code: it is required to actually run the program, after all. However you also save the entire prompt chain somewhere. Then (as stated in the article), when a much better model comes along, you re-run that chain, presumably with prompting like "create this project, focusing on efficiency" or "create this project in Rust" or "create this project, focusing on readability of the code". Then you run the tests against the new codebase and if the suite passes you carry on, with a much improved codebase. The theoretical benefit of this over just giving your previously generated code to the LLM and saying "improve the readability" is that the newer (better) LLM is not burdened by the context of the "worse" decisions made by the previous LLM.

Obviously it's not actually that simple, as tests don't catch everything (tho with fuzz testing and complete coverage and such they can catch most issues), but we programmers often treat them as if they do, so it might still be a worthwhile endeavor.

stingraycharles

Means the temperature should be set to 0 (which not every provider supports) so that the output becomes entirely deterministic. Right now with most models if you give the same input prompt twice it will give two different solutions.

NitpickLawyer

Even at temp 0, you might get different answers, depending on your inference engine. There might be hardware differences, as well as software issues (e.g. vLLM documents this, if you're using batching, you might get different answers depending on where in the batch sequence your query landed).

weird-eye-issue

Claude Code already uses a temperature of 0 (just inspect the requests) but it's not deterministic

Not to mention it also performs web searches, web fetching etc which would also make it not deterministic

derwiki

Two years ago when I was working on this at a startup, setting OAI models’ temp to 0 still didn’t make them deterministic. Has that changed?

afiori

Do LLMs inference engines have a way to seed their randomness? so tho have reproducible outputs with still some variance if desired?

fastball

I would only care about more deterministic output if I was repeating the same process with the same model, which is not the point of the exercise.

pollinations

I'd say commit a comprehensive testing system with the prompts.

Prompts are in a sense what higher level programming languages were to assembly. Sure there is a crucial difference which is reproducibility. I could try and write down my thoughts why I think in the long run it won't be so problematic. I could be wrong of course.

I run https://pollinations.ai which servers over 4 million monthly active users quite reliably. It is mostly coded with AI. Since about a year there was no significant human commit. You can check the codebase. It's messy but not more messy than my codebases were pre-LLMs.

I think prompts + tests in code will be the medium-term solution. Humans will be spending more time testing different architecture ideas and be involved in reviewing and larger changes that involve significant changes to the tests.

visarga

The idea is good, but we should commit both documentation and tests. They allow regenerating the code at will.

never_inline

Apart from obvious non-reproducibility, the other problem is lack of navigable structure. I can't command+click or "show usages" or "show definition" any more.

saagarjha

Just ask the AI for those obviously

7speter

I think the author is saying you commit the prompt with the resulting code. You said it yourself, storage is free, so comment the prompt along with the output (don’t comment that out that if I’m not being clear); it would show the developers(?) intent, and to some degree, almost always contribute to the documentation process.

declan_roberts

These posts are funny to me because prompt engineers point at them as evidence of the fast-approaching software engineer obsolescence but the years of experience in software engineering necessary to even guide an AI in this way is very high.

The reason he keeps adjusting the prompts is because he knows how to program. He knows what it should look like.

It just blurs the line between engineer and tool.

latexr

> It just blurs the line between engineer and tool.

I realise you meant it as “the engineer and their tool blend together”, but I read it like a funny insult: “that guy likes to think of himself as an engineer, but he’s a complete tool”.

spaceman_2020

The argument is that this stuff will so radically improve senior engineer productivity that the demand for junior engineers will crater. And without a pipeline of junior engineers, the junior-to-senior trajectory will radically atrophy

Essentially, the field will get frozen where existing senior engineers will be able to utilize AI to outship traditional senior-junior teams, even as junior engineers fail to secure employment

I don’t think anything in this article counters this argument

tptacek

I don't know why people don't give more credence to the argument that the exact opposite thing will happen.

dcre

Right. I don’t understand why everyone thinks this will make it impossible for junior devs to learn. The people I had around to answer my questions when I was learning knew a whole lot less than Claude and also had full time jobs doing something other than answering my questions.

spaceman_2020

If a junior engineer ships a similar repo to this with the help of AI, sure, I'll buy that.

But as of now, it's senior engineers who really know what they 're doing who can spot the errors in AI code.

visarga

> prompt engineers point at them as evidence of the fast-approaching software engineer obsolescence

Maybe journalists and bloggers angling for attention do it, prompt engineers are too aware of the limitations of prompting to do that.

tptacek

I don't know why that's funny. This is not a post about a vibe coding session. It's Kenton Varda['s coding session].

later

updated to clarify kentonv didn't write this article

kiitos

The sequence of commits talked about by the OP -- i.e. kenton's coding session's commits -- are like one degree removed from 100% pure vibe coding.

tptacek

Your claim here being that Kenton Varda isn't reading the code he's generating. Got it. Good note.

kevingadd

I think it makes sense that GP is skeptical of this article considering it contains things like:

> this tool is improving itself, learning from every interaction

which seem to indicate a fundamental misunderstanding of how modern LLMs work: the 'improving' happens by humans training/refining existing models offline to create new models, and the 'learning' is just filling the context window with more stuff, not enhancement of the actual model or the model 'learning' - it will forget everything if you drop the context and as the context grows it can 'forget' things it previously 'learned'.

Fischgericht

So, it means that you and the LLM together have managed to write SEVEN lines of trivial code per hour. On a protocol that is perfectly documented, where you can look at about one million other implementations when in doubt.

It is not my intention to hurt your feelings, but it sounds like you and/or the LLM are not really good at their job. Looking at programmer salaries and LLM energy costs, this appears to be a very very VERY expensive OAuth library.

Again: Not my intention to hurt any feelings, but the numbers really are shockingly bad.

Fischgericht

Yes, my brain got confused on who wrote the code and who just reported about it. I am truly sorry. I will go see my LLM doctor to get my brain repaired.

eviks

> Imagine version control systems where you commit the prompts used to generate features rather than the resulting implementation.

So every single run will result in different non-reproducible implementation with unique bugs requiring manual expert interventions. How is this better?

thorum

Humorous that this article has a strong AI writing smell - the author should publish the prompts they used!

dcre

I don’t like to accuse, and the article is fine overall, but this stinks: “This transparency transforms git history from a record of changes into a record of intent, creating a new form of documentation that bridges human reasoning and machine implementation.”

ZephyrBlu

Also: "This OAuth library represents something larger than a technical milestone—it's evidence of a new creative dynamic emerging"

Em-dash baby.

ZeroTalent

I have used Em-dashes in many of my comments for years. It's just a result of reading books, where Em-dashes happen a lot.

latexr

Can we please stop using the em-dash as a metric to “detect” LLM writing? It’s lazy and wrong. Plenty of people use em-dashes, it’s a useful punctuation mark. If humans didn’t use them, they wouldn’t be in the LLM training data.

There are better clues, like the kind of vague pretentious babble bad marketers use to make their products and ideas seem more profound than they are. It’s a type of bad writing which looks grandiose but is ultimately meaningless and that LLMs heavily pick up on.

keybored

> I don’t like to accuse, and the article is fine overall, but this stinks:

Now consider your reasonable instinct to not accuse other people coupled with the possibility setting AI lose with “write a positive article about AI where you have some paragraphs about the current limitations based on this link. write like you are just following the evidence.” Meanwhile we are supposed to sit here and weigh every word.

This reminds to write a prompt for a blogpost. How AI could be used for making personal-looking tech-guy who meditates and runs websites. (Do we have the technology? Yes we do)

lmeyerov

If/when to commit prompts has been fascinating as we have been doing similarly to build Louie.ai. I now have several categories with different handling:

- Human reviewed: Code guidelines and prompt templates are essentially dev tool infra-as-code and need review

- Discarded: Individual prompt commands I write, and implementation plan progress files the AI write, both get trashed, and are even part of my .gitignore . They were kept by Cloudflare, but we don't keep these.

- Unreviewed: Claude Code does not do RAG in the usual sense, so it is on us to create guides for how we do things like use big frameworks. They are basically indexes for speeding up AI with less grepping + hallucinating across memory compactions. The AI reads and writes these, and we largely stay out of it.

There are weird cases I am still trying to figure out. Ex:

- feature impl might start with an AI coming up with the product spec, so having that maintained as the AI progresses and committed in is a potentially useful artifact

- how prompt templates get used is helpful for their automated maintenance.

SupremumLimit

It's an interesting review but I really dislike this type of techno-utopian determinism: "When models inevitably improve..." Says who? How is it inevitable? What if they've actually reached their limits by now?

Dylan16807

Models are improving every day. People are figuring out thousands of different optimizations to training and to hardware efficiency. The idea that right now in early June 2025 is when improvement stops beggars belief. We might be approaching a limit, but that's going to be a sigmoid curve, not a sudden halt in advancement.

a2128

I think at this point we're reaching more incremental updates, which can score higher on some benchmarks but then simultaneously behave worse with real-world prompts, most especially if they were prompt engineered for a specific model. I recall Google updating their Flash model on their API with no way to revert to the old one and it caused a lot of people to complain that everything they've built is no longer working because the model is just behaving differently than when they wrote all the prompts.

deadbabe

5 years ago a person would be blown away by today’s LLMs. But people today will merely say “cool” at whatever LLMs are in use 5 years from now. Or maybe not even that.

tptacek

Most of the developers I know personally who have been radicalized by coding agents, it happened within the past 9 months. It does not feel like we are in a phase of predictable boring improvement.

dingnuts

5 years ago GPT2 was already outputting largely coherent speech, there's been progress but it's not all that shocking

dwaltrip

Bold prediction…

sitkack

It is copium that it will suddenly stop and the world they knew before will return.

ChatGPT came out in Nov 2022. Attention Was All There Was in 2017, we were already 5 years in the past. Or 5 years of research to catch up to, and then from 2022 to now ... papers and research have been increasing exponentially. Even in if SOTA models were frozen, we still have years of research to apply and optimize in various ways.

null

[deleted]

rxtexit

The copium is I think many people got comfortable post financial crisis with nothing much changing or happening. I think many people really liked a decade stretch with not much more than web framework updates and smart phone versioning.

We are just back on track.

I just read Oracular Programming: A Modular Foundation for Building LLM-Enabled Software the other day.

We don't even have a new paradigm yet. I would be shocked that in 10 years I don't look back at this time of writing a prompt into a chatbot and then pasting the code into an IDE as completely comical.

The most shocking thing to me is we are right back on track to what I would have expected in 2000 for 2025. In 2019 those expectations seemed like science fiction delusions after nothing happening for so long.

BoorishBears

I think it's equally copium that people keep assuming we're just going to compound our way into intelligence that generalizes enough to stop us from handholding the AI, as much as I'd genuinely enjoy that future.

Lately I spend all day post-training models for my product, and I want to say 99% of the research specific to LLMs doesn't reproduce and/or matter once you actually dig in.

We're getting exponentially more papers on the topics and they're getting worse on average.

Every day there's a new paper claiming an X% gain by post-training some ancient 8B parameter model and comparing it to a bunch of other ancient models after they've overfitted on the public dataset of a given benchmark and given the model a best of 5.

And benchmarks won't ever show it, but even ChatGPT 3.5-Turbo has better general world knowledge than a lot models people consider "frontier" models today because post-training makes it easy to cover up those gaps with very impressive one-prompt outputs and strong benchmark scores.

It feels like things are getting stuck in a local maxima: we are making forward progress, the models are useful and getting more useful, but the future people are envisioning takes reaching a completely different goal post that I'm not at all convinced we're making exponential progress towards.

There maybe exponential number of techniques claiming to be ground breaking, but what has actually unlocked new capabilities that can't just as easily be attributed to how much more focused post-training has become on coding and math?

Test time compute feels like the only one and we're already seeing the cracks form in terms of its effect on hallucinations, and there's a clear ceiling for the performance the current iteration unlocks as all these models are converging on pretty similar performance after just a few model releases.

sumedh

More compute mean more faster processing, more context.

groby_b

It is "inevitable" in the sense that in 99% of the cases, tomorrow is just like yesterday.

LLMs have been continually improving for years now. The surprising thing would be them not improving further. And if you follow the research even remotely, you know they'll improve for a while, because not all of the breakthroughs have landed in commercial models yet.

It's not "techno-utopian determinism". It's a clearly visible trajectory.

Meanwhile, if they didn't improve, it wouldn't make a significant change to the overall observations. It's picking a minor nit.

The observation that strict prompt adherence plus prompt archival could shift how we program is both true, and it's a phenomenon we observed several times in the past. Nobody keeps the assembly output from the compiler around anymore, either.

There's definitely valid criticism to the passage, and it's overly optimistic - in that most non-trivial prompts are still underspecified and have multiple possible implementations, not all correct. That's both a more useful criticism, and not tied to LLM improvements at all.

double0jimb0

Are there places that follow the research that speak to the layperson?

Sevii

Models have improved significantly over the last 3 months. Yet people have been saying 'What if they've actually reached their limits by now?' for pushing 3 years.

greyadept

For me, improvement means no hallucination, but that only seems to have gotten worse and I'm interested to find out whether it's actually solvable at all.

dymk

All the benchmarks would disagree with you

tptacek

Why do you care about hallucination for coding problems? You're in an agent loop; the compiler is ground truth. If the LLM hallucinates, the agent just iterates. You don't even see it unless you make the mistake of looking closely.

BoorishBears

This is just people talking past each other.

If you want a model that's getting better at helping you as a tool (which for the record, I do), then you'd say in the last 3 months things got better between Gemini's long context performance, the return of Claude Opus, etc.

But if your goal post is replacing SWEs entirely... then it's not hard to argue we definitely didn't overcome any new foundational issues in the last 3 months, and not too many were solved in the last 3 years even.

In the last year the only real foundational breakthrough would be RL-based reasoning w/ test time compute delivering real results, but what that does to hallucinations + even Deepseek catching up with just a few months of post-training shows in its current form, the technique doesn't completely blow up any barriers that were standing the way people were originally touting it.

Overall models are getting better at things we can trivially post-train and synthesize examples for, but it doesn't feel like we're breaking unsolved problems at a substantially accelerated rate (yet.)

atomlib

https://xkcd.com/605/

viraptor

The documentation angle is really good. I've noticed it with the mdc files and llm.txt semi-standard. Documentation is often treated as just extra cost and a chore. Now, good description of the project structure and good examples suddenly becomes something devs want ahead of time. Even if the reason is not perfect, I appreciate this shift we'll all benefit from.

cosmok

I have documented my experience using an Agent for a slightly different task -- upgrading framework version -- I had to abandon the work, but, my learning has been similar what is in the post.

https://www.trk7.com/blog/ai-agents-for-coding-promise-vs-re...

kookamamie

> Treat prompts as version-controlled assets

This only works if the model and its context are immutable. None of us really control the models we use, so I'd be sceptical about reproducing the artifacts later.

moron4hire

I'm sorry, this all sounds like a fucking miserable experience. Like, if this is what my job becomes, I'll probably quit tech completely.

never_inline

> Don't be afraid to get your hands dirty. Some bugs and styling issues are faster to fix manually than to prompt through. Knowing when to intervene is part of the craft.

This has been my experience as well. to always run the cli tool in the bottom pane of an IDE and not in a standalone terminal.

HN

I Read All of Cloudflare's Claude-Generated Commits

I Read All of Cloudflare's Claude-Generated Commits