The State of AI Coding Report 2025
60 comments
·December 17, 2025zkmon
a_imho
My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.
As a dev I very much subscribe to this line of thought, but I also have to admit most of the business class people would disagree.
scuff3d
It shouldn't be taken with a pinch of salt, it should be disregarded entirely. It's an utterly useless metric, and given that the report leads with it makes the entire thing suspect.
dakshgupta
How would you measure code quality? Would persistence be a good measure?
scuff3d
That question has been baffling product managers, scrum masters, and C-suite assholes for decades. Along with how you measure engineering productivity.
epicureanideal
Bad code can persist because nobody wants to touch it.
Unfortunately I’m not sure there are good metrics.
hvb2
Agreed, I stopped reading at that point. You can't take yourself seriously to create a report and use LOC as your measure.
I feel like we humans try to separate things and keep things short. We do this not because we think it's pretty, we do it so our human brains can still reason about a big system. As a result LOC is a bad measure as being concise then hurts your productivity????
dakshgupta
We're careful not to draw any conclusions from LoC. The fact is LoCs are higher, which by itself is interesting. This could be a good or bad thing depending on code quality, which itself varied wildly person-to-person and agent-to-agent.
mrdependable
Can you expand on why it is interesting?
locusofself
This is definitely interesting information and I plan to take a deeper look at it.
What a lot of us must be wondering though is:
- how maintainable is the code being outputted
- how much is this newfound productivity saving (costing) on compute, given that we are definitely seeing more code
- how many livesite/security incidents will be caused by AI generated code that hasn't been reviewed properly
dakshgupta
We weren’t able to agree on a good way to measure this. Curious - what’s your opinion on code churn as a metric? If code simply persists over some number of months, is that indication it’s good quality code?
arcwhite
I've seen code persist a long time because it is unmaintainable gloop that takes forever to understand and nobody is brave enough to rebuild it.
So no, I don't think persistence-through-time is a good metric. Probably better to look at cyclomatic complexity, and maybe for a given code path or module or class hierarchy, how many calls it makes within itself vs to things outside the hierarchy - some measure of how many files you need to jump between to understand it
wordpad
I've seen code entropy as the suggested hueriatic to measure.
simonw
> Lines of code per developer grew from 4,450 to 7,839 as AI coding tools act as a force multiplier.
Is that a per-year number?
If a year has 200 working days that's still only about 40 lines of code a day.
When I'm in full-blown work mode with a decent coding agent (usually Claude Code) I'm genuinely producing 1,000+ lines of (good, tested, reviewed) code a day.
Maybe there is something to those absurd 10x multiplier claims after all!
(I still think there's plenty of work done by software engineers that isn't crunching out code, much of which isn't accelerated by AI assistance nearly as much. 40 lines of code per day felt about right for me a few years ago.)
observationist
If you actually work, the amount of work you do is absurdly more than the amount of work most others do, and a lot of the time, both the high and low productivity people assume everyone just does as much as they do, in both directions.
A lot of people are oblivious to Zipf distributions in effort and output, and if you ever catch on to it as a productive person, it really reframes ideas about fairness and policy and good or bad management.
It also means that you can recognize a good team, and when a bunch of high performers are pushing and supporting eachother and being held to account out in the open, amazing things happen that just make other workplaces look ridiculous.
My hope for AI is that instead of 20% of the humans doing 80% of the work, you end up with force multipliers, and a ramping up, so that more workplaces look like high function teams, making everything more fair and engaging and productive, but i suspect once people get better with AI, at least up to the point of AGI, is we're going to see the same distribution but 10x or 50x the productivity.
lumost
There is a long tail of engineers working on mature/stable code bases where there are fewer extremely large diffs, or the review burden is extremely high. If you work on core software - then you can never say that a line of code was wrong "because of the AI." e.g. places where you might need 2-3x code approvers or more.
rnewme
1k loc per day or 1k git additions? I don't think one person can consistently review 1k loc, and grow codebase at that speed and size and classify it as good, tested and reviewed.. Can you tell us more about your process?
simonw
I'm effectively no longer typing code by hand: I decide what change I want to make and then prompt Claude Code to describe that change. Sometimes I'll have it figure out the fix too.
An example from earlier today: https://github.com/simonw/llm-gemini/commit/fa6d147f5cff9ea9...
That commit added 33 lines and removed 13 - so I'm already at a 20-lines-a-day level just from that one commit (and I shipped a few more plus a release of llm-gemini: https://github.com/simonw/llm-gemini/commits/a2bdec13e03ca8a...)
It took about 3.5 minutes. I started from this issue someone had filed against my repo:
Then I opened Claude Code and said:
Run this command: uv run llm -m gemma-3-27b-it hi
That ran the command and returned the error message. I then said: Yes, fix that - the gemma models do not support media resolution
Which was enough for it to figure out the fix and run the tests to confirm it hadn't broken anything.I ran "git diff", thought about the change it had made for a moment, then committed and pushed it.
Here's the full Claude Code transcript: https://gistpreview.github.io/?62d090551ff26676dfbe54d8eebbc...
I verified the fix myself by running:
uv run llm -m gemma-3-27b-it hi
I pasted the result into an issue comment to prove to myself (and anyone else who cares) that I had manually verified the fix: https://github.com/simonw/llm-gemini/issues/116#issuecomment...Here's a more detailed version of the transcript including timestamps, showing my first prompt at 10:01:13am and the final response at 10:04:55am. https://tools.simonwillison.net/claude-code-timeline?url=htt...
I built that claude-code-timeline application this morning too, and that thing is 2284 lines of code: https://github.com/simonw/tools/commits/main/claude-code-tim... - but that was much more of a vibe-coded thing, I hardly reviewed the code that was written at all and shipped it as soon as it appeared to work correctly. Since it's a standalone HTML file there's not too much that can go wrong if it has bugs in it.
WhyOhWhyQ
Whenever I start reviewing code produced by Claude I find hundreds of ways to improve it.
I don't know if code quality really matters to most people or to the bottom line, but a good software engineer writes better code than Claude. It is a testament to library maintainers that Claude is able to code at all, in my opinion. One reason is that Claude uses API's in whacky ways. For instance by reading the SDL2 documentation I was able to find many ways that Claude writes SDL2 using archaic patterns from the SDL days.
I think there are a lot of hidden ways AI booster types benefit from basic software engineering practices that they actively promote damaging ideas about. Maybe it will only be 10 years from now that we learn that having good engineers is actually important.
leothetechguy
I couldn't in good conscience work like that, I believe the risk of bad AI generated code due to the tiniest of output variation is far too high. Especially in systems that need to maintain a large state governed by many rules and edge cases.
WhyOhWhyQ
You're writing Python and Javascript right? Those languages are extremely easy to write in (which conversely means the legibility is likely to be poor). People maintaining legacy systems in systems level languages aren't going to be able to produce as much code as people writing Python and Javascript.
CrzyLngPwd
1,000 lines of debt that you didn't review and probably have no idea what they do.
noosphr
I'm a good aerospace engineer, my rockets weigh an extra 50kg after every day I work on them.
cmdtab
I saw your example and it was a simple cli tool. Of course you can have claude make commits effectively to it!
simonw
Totally. I have dozens of "simple CLI tools" that I work on - and small plugins, and HTML+JavaScript utilities.
If I was hacking on the Linux kernel I would be delighted with myself for producing 40 lines of landed code in a single day.
eikenberry
They are obviously talking about writing code against expectations greater than these simple tools. Why troll with the hyperbole?
TuringNYC
Kudos to the designer, this site is beautiful.
a1ff00
Was going to comment the same. Love the dot matrix paper look.
dionian
agreed. was it AI ?! not that i care - ive been doing a lot of tailwind apps in ai with great success. AI is great for the web, takes all the tedium out of it
dakshgupta
Hi, I'm Daksh, a co-founder of Greptile. We're an AI code review agent used by 2,000 companies from startups like PostHog, Brex, and Partiful, to F500s and F10s.
About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.
We decided to compile some of the most interesting findings into a report. This is the first time we've done this, so any feedback would be great, especially around what analytics we should include next time.
ChrisbyMe
Hey! Thanks for publishing this.
Would be interested in seeing the breakdown between uplift vs company size.
e.g. I work in a FAANG and have seen an uptick in the number of lines on PRs, partially due to AI coding tools and partially due to incentives for performance reviews.
dakshgupta
This is a good one, wish we had included it. I'd run some analysis on this a while ago and it was pretty interesting.
An interesting subtrend is that Devin and other full async agents write the highest proportion of code at the largest companies. Ticket-to-PR hasn't worked nearly as well for startups as it has for the F500.
jacekm
> About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.
Which stats in the report come from such analysis? I see that most metrics are based on either data from your internal teams or publicly available stats from npm and PyPi.
Regardless of the source, it's still an interesting report, thank you for this!
dakshgupta
Thanks! The first 4 charts as well as Chart 2.3 are all from our data!
neom
If AI tools are making teams 76% faster with 100% more bugs, one would presume you're not more productive you're just punting more debt. I'm no expert on this stuff, but coupling it with some type of defect density insights might be helpful. Would be also interested to know what percentage of AI assisted code is "rolled back" or "reverted" within 48 hours. Has there been any change in number of review iterations over time?
chis
Wish you'd show data from past years too! It's hard to know if these are seasonal trends or random variance without that.
Super interesting report though.
wrs
It’s hard to reach any conclusion from the quantitative code metrics in the first section, because as we all know, more code is not necessarily better. “Quantity” is not actually the same as “velocity”. And that gets to the most important question people have about AI assistance: does help you maintain a codebase long term, or does it help you fly headlong into a ditch?
So, do you have any quality metrics to go with these?
dakshgupta
We weren’t able to find a good quality measure. LLM-as-judge dint feel right. You’re correct that without that the data is interesting but not particular insightful.
vb-8448
In the engineering team velocity section, the most important metric is missing: change rate of new code or how many times it is change before being fully consolidated.
dakshgupta
This is a great suggestion. I'll note it down for next years. Curious, do you think this would be a good proxy for code quality?
vb-8448
It's tricky, but one can assume that code written once and not touched in a while is good code (didn't cause any issues, performance is good enough, ecc).
I guess you can already derive this value if you sum the total line changed by all PRs and divide it by (SLOC end - SLOC start). Ideally it must be a value slightly greater than 1.
all2
I would consider feature complete with robust testing to be a great proxy for code quality. Specifically, that if a chunk of code is feature complete and well tested and now changing slowly, it means -- as far as I can tell -- that the abstractions contained are at least ok at modeling the problem domain.
I would expect code that continually changes and deprecates and creates new features is still looking for a good problem domain fit.
dakshgupta
Most of our customers are enterprises, so I feel relatively comfortable assuming they have some decent testing and QA in place. Perhaps I am too optimistic?
sillyfluke
It depends on how well you vetted your sanples.
fyi: You headline with "cross-industry", lead with fancy engineering productivity graphics, then caption it with small print saying its from your internal team data. Unless I'm completely missing something, it comes of as a little misleading and disingenuous. Maybe intro with what your company does and your data collection approach.
dakshgupta
Apologies, that is poor wording on our part. It's internal data from engineers that use Greptile, which are tens of thousands of people from a variety of industries. As opposed to external, public data, which is where some of the charts are from.
superchris
This thing that can't be measured is up 76%. Eyeroll
nekooooo
i'm a designer and even i know not to measure 'lines of code' as meaningful output or impact. are we really doing this?
dakshgupta
We expressly did not conclude that more lines = better. You could easily argue more lines = worse. All we wanted to show is that there are more lines.
poliphili
Language like "productivity gains", "output" and "force multiplier" isn't neutral like you're claiming here, and does imply that the line count metric indicates value being delivered for the business.
null
nik0xffff
[flagged]
psunavy03
Sigh . . . once again I see "velocity" as something to be increased.
This makes me metaphorically stabby.
dakshgupta
We were trying not to insinuate that, because we don’t have a good way to measure quality, without which velocity is useless.
rnewme
[flagged]
I take this "code-output" metrics with a pinch of salt. Ofcourse, a machine can generate 1000 times more lines of code similar to a power loom does. However, the comparison with power loom ends there.
How maintainable is this code output? I saw a SPA html file produced by a model, which appeared almost similar to assembly code. So if the code can only be maintained by model, then an appropriate metric should should be based on a long-term maintainability achieved, but not on instant generation of code.