The AI Code Review Disconnect: Why Your Tools Aren't Solving Your Real Problem

34 comments

·March 1, 2025

Legend2440

Man, I hate when I'm reading a blogpost and then I realize the whole thing is just an ad for a startup.

tomrod

One thing I have liked on one of the small business reddits is they require "I will not promote" in submission and as a rule.

Stealth ads are annoying.

loxs

That's why I first read the comments to decide if it's worth my time.

Boldened15

I'm curious, why? The startup connection is disclosed in the third sentence so it's not particularly hidden. And at least if it's a real product they're putting their money where their mouth, you can check out their product or reviews from customers to see if their approach is right.

Someone working on AI tooling for code reviews is exactly the right person I'd want to get an opinion from on the space, otherwise it's just opining with no validation.

fourside

Because someone selling AI tooling is incentivized to sell you AI, not to inform you.

avikalp

I'm sorry, I didn't mean it to be an ad. I have been interviewing engineering leaders for months, and my startup idea is born out of it. I don't have the product ready yet - it is evolving based on what I am learning.

I just thought it would be a good idea to share what I have learnt.

Herring

AI is not good enough yet for anything requiring deep reasoning, mission-critical work, error detection at a human-expert level, or handling unpredictable edge cases.

It just talks like it's very smart, and humans apparently have a bias for persuasive communication skills. It's also very fast, which humans also think indicates general intelligence. But it's not, and that's why most LLM tools are author-focused, so that a human expert can catch errors.

The way you know fully autonomous driving is nowhere near ready is by noticing we don't even trust robots to do fully autonomous cooking and cleaning. Similarly, let's see it understand and refactor a massive codebase first.

avikalp

I have had a similar discussion with a fellow On-Deck Founder, and here is where we reached:

- More than being "good enough", it is about taking responsibility. - A human can make more mistakes than an AI, and they are still the more appropriate choice because humans can be held responsible for their actions. AI, by its very nature, cannot be 'held responsible' -- this has been agreed upon based on years of research in the field of "Responsible AI". - To completely automate anything using AI, you need a way to trivially verify whether it did the right thing or not. If the output cannot be verified trivially, you are just changing the nature of the job, and it is still a job or a human being (like the staff you mentioned who remotely control Waymos when something goes wrong). - If an action is not trivially verifiable and requires AI's output to directly reach the end-user without a human-in-the-loop, then the creator is taking a massive risk. Which usually doesn't make sense for a business when it comes to mission-critical activities.

In Waymo's case, they are taking massive risks because of Google's backing. But it is not about being 'good enough'. It is about the results of the AI being trivially verifiable - which, in the case of driving, is true. You just need three yes/no answers: Did the customer reach where they wanted? Are they safe? Did they arrive on time? Are they happy with the experience?

Herring

I'd be really hesitant to say anything involving humans and human judgement under uncertainty is trivial. What if the customer wants the car to drive aggressively, maybe speed a little where it "seems" safe? Should the car stop for an object that might be a plastic bag or a child's backpack? Even manual drivers are difficult to "verify" because accidents and traffic violations depend on interpretations of events, which is why we often have to go to court.

mrshadowgoose

> The way you know fully autonomous driving is nowhere near ready

How do you reconcile this claim with Waymo's dramatically increased rate of expansion these past few years?

Herring

Billions of dollars from Google, basically.

https://www.businessinsider.com/robotaxis-may-mobility-tesla...

High operational costs, low revenue potential, technical difficulties, competitors exiting the space.

mrshadowgoose

Sorry, that's goalpost moving.

Just reminding you of your earlier claim:

> AI is not good enough yet for anything requiring deep reasoning, mission-critical work...

Is driving a mission-critical function? Due to its safety critical nature, many would say "yes".

So have you simply pivoted to "oh it does work, but it's not as profitable as it should be"?

llm_trw

>AI is not good enough yet for anything requiring deep reasoning, mission-critical work, error detection at a human-expert level, or handling unpredictable edge cases.

Ai is better than humans at all those things. It's not good at those things when the context it needs to look over is more than a few thousand tokens.

Rejoice programmer, for your inability to write modular code saved your job.

azthecx

Apt username for such a bonkers response

yyyyz

[dead]

shermantanktop

I thought my 3000 line kitchen sink function which mutates globals, uses n+1 fetching, and supports 50 feature flags was a bad idea…maybe not?

lukaslalinsky

Is the purpose of these tools really to spend less time? I think their main value is reducing mistakes through having one extra set of eyes, even if mechanical ones, looking at the code.

As a sole developer of a non-trivial open source project, I've recently started using CodeRabbit, very skeptical about it, but right on the first PR, it actually found a bug that my CI tests did not catch, decided to keep it after that.

Gemini Code Assist on the other hand, the first suggestion it did would actually lead to a bug, so that was out immediately.

avikalp

What you are saying is true, and this is the feedback I hear every time I talk to a small team of developers (generally fewer than 15 developers).

At this stage, you don't need "another set of eyes" because it is not that big of a problem to break something, as you are not going to lose massive amounts of money because of the mistake.

All these teams need is a sanity check. They also generally (even without the AI code reviewers) do not have a strong code review process.

This is why, in the article, I have clearly mentioned that these are learning based on talking to engineers in Series-B and Series-C startups.

layer8

These code review tools are basically analogous in function to a linter. They flag potential issues, but you still have to check all of them.

CompoundEyes

I put in a code reviewer that runs and comments when a pull request is created using Github actions and Microsoft GenAIScript. It's pretty straightforward. The key thing is we have total control over the prompt to fit our repo and devs needs, can make it multi-stage and deterministic using Typescript code or use agents in GenAIScript to open adjacent files for more context. The value we've received is that a dev can look over the review to catch anything they might have missed and make changes all before another dev looks at it. That saves time. I've seen devs open draft pull requests to get preliminary feedback on work in progress. The reviewer script is versioned with the repo. Currently using a mix of gpt-4o and gpt-4o-mini in parts of the script to do smaller tasks.

HoyaSaxa

I’d be interested in seeing the scripts if you are able to share (redacted) versions of them

CompoundEyes

I can't do that at the moment but the GenAIScript project repo has a good plug and play version that I built upon. In my opinion agent Typescript classes that return structured data tied to interfaces/types with their own memory is where it's at. Also been experimenting with a state machine class to hot potato the output between an agent that judges the results and the worker agent until satisfied.

donmcronald

How do you make it deterministic?

CompoundEyes

Sorry I meant that it's javascript / typescript so we can deterministically orchestrate a series of prompts and shape their output exactly how we'd like. Returning the review as structured output as a JSON object is very helpful for this. If the review result seems bungled, run a judge prompt at the end and tell it to go try again ^_^.

donmcronald

That makes sense. You could even run against multiple models or future models then, right? I can see some value in that because maybe 2 years from now the models will be able to surface issues that weren’t detected originally. I suppose you could run against the whole codebase in the future, but could also imagine something that could track down where a bug was introduced.

Do you save the reviews or discard them?

mschild

We didn't purchase a tool, but instead built our own.

> most AI code review tools on the market today are fundamentally author-focused, not reviewer-focused.

This pretty much describes our experience. Our engineers create a PR and now wait for the review bot to provide feedback. The author will any fix any actual issues the bot brings up and only then will they publish the PR to the rest of the team.

From our experience there are 4 things that make the bot valuable:

1. Any general logical issues in the code are caught with relative certainty (not evaluating a variable value properly or missing a potential edge case, etc).

2. Some of the comments the bot leaves are about the business logic in code and asking about it and having the author provide a clearer explanation helps reviewers to understand what's going on as well if it wasn't clear enough from the code itself.

3. We provide a frontend platform to other engineers in the company that our operations teams interact with. The engineers rarely implement more than 1-2 features a year. We gave the bot a list of coding and frontend guidelines that we enforce (capitalisation rules, title formatting, component spacing, etc) and it will remind reviewers about these requirements.

4. We told it to randomly change it's way of talking from Yoda to Dr Seuss and some of the comments, while correct on a technical level, are absolutely hilarious and can give you a short giggle in an otherwise stressful day.

savanaly

The thing is that inserting AI into the code reviewer side doesn't make too much sense. Unless they have a different AI doing the reviewing than the one that helped to write it, there won't be anything left to say at that stage. The AI was already involved in writing it and as they mention in the article there's points in the writing-with-AI process where the AI editor will try to catch bugs, educate the developer, and so forth. If the reviewing AI can catch further bugs that's just proof the writing AI needs to be tightened up, not that there's a role for a reviewing AI.

The commentary given above is invalid if due to the preferences of the human developers or just weird protocol in their working relationship they end up with different AI's in the two instances. But I think in the long term equilibrium this point applies.

avikalp

I agree with you 100%.

In the maker-checker process, if we are imagining a future where AI will be writing/editing most of the code, the AI-code-review tools will need to integrate within its agentic process.

And the job of a better code-review interface (like the one that I am trying to build) would be to provide a higher level of abstraction to the user so that they can verify the output of the AI code generators more effectively.

shermantanktop

An AI tool that told the author how to create a CR that was readable and changed the minimum amount of stuff in one go would actually be helpful. Multipage CRs are only ok if it’s a bulk reformat or file move operation.

avikalp

This is only true for "development branch" CRs/pull requests. The whole is greater than the sum of the parts. Every small change in the feature that you are building might make complete sense, so every dev-to-feature branch pull request would get approved easily.

But if you not also reviewing the feature-to-main branch pull request, you are just inviting problems. That is a bigger CR that you should review carefully, and there is no way that could be a small CR.

shermantanktop

I’ve done that gig many times. I take your point but if there’s a problem, figuring out who to blame always means going back to the original dev CR.

HN

The AI Code Review Disconnect: Why Your Tools Aren't Solving Your Real Problem

The AI Code Review Disconnect: Why Your Tools Aren't Solving Your Real Problem