Learnings from building AI agents

48 comments

·June 26, 2025

Oras

The problem is that, regardless of how you try to use "micro-agents " as a marketing term, LLMs are instructed to return a result.

They will always try to come up with something.

The example provided was a poor one. The comment from LLM was solid. Why would you comment out a step in the pipeline instead of just deleting it? I would comment the same in a PR.

SparkyMcUnicorn

I've found that giving agents an "opt out" works pretty well.

For structured outputs, making fields optional isn't usually enough. Providing an additional field for it to dump some output, along with a description for how/when it should be used, covers several issues around this problem.

I'm not claiming this would solve the specific issues discussed in the post. Just a potentially helpful tip for others out there.

ffsm8

Likely because it's temporary?

It takes less effort to re-enable if it's just commented out and its more visible that there is something funky going on that someone should fix.

But yeah, even if it's temporary, it really should have the rationale for commenting it out added... It takes like 5s and provides important context for reviewers and people looking through the file history in the future.

elzbardico

Funny thing is the structured output in the last example.

``` { "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47", "finding": "Possible nil‑pointer dereference", "confidence": 0.81 } ```

You know the confidence value is completely bogus, don't you?

munificent

Easy fix, just have the LLM generate:

    {
      "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
      "finding": "Possible nil‑pointer dereference",
      "confidence": 0.81,
      "confidence_in_confidence_rating": 0.54,
      "confidence_in_confidence_rating_in_confidence_rating": 0.12,
      "confidence_in_confidence_rating_in_confidence_rating_in_confidence_rating": 0.98,
      // Etc...
    }

zengid

confidence all the way down

ramity

I too once fell into the trap of having an LLM generate a confidence value in a response. This is a very genuine concern to raise.

sharkjacobs

Do you mean that there is no correlation between confidence and false positives or other errors?

ramity

elzbardico is pointing out how the author is having the confidence value generated in the output of the response rather than it being the confidence of the output.

singron

I think they skipped over a non-obvious motivating example too fast. On first glance, commenting out your CI test suite would be very bad to sneak into a random PR, and that review note might be justified.

I could imagine the situation might actually be more nuanced (e.g. adding new tests and some of them are commented out), but there isn't enough context to really determine that, and even in that case, it can be worth asking about commented out code in case the author left it that way by accident.

Aren't there plenty of more obvious nitpicks to highlight? A great nitpick example would be one where the model will also ask to reverse the resolution. E.g.

    final var items = List.copyOf(...);
    <-- Consider using an explicit type for the variable.

    final List items = List.copyOf(...);
    <-- Consider using var to avoid redundant type name.

This is clearly aggravating since it will always make review comments.

willsmith72

yep completely agreed, how can that be the best example they chose to use?

If I reviewed that PR, absolutely I'd question why you're commenting that out. There better be a very good reason, or even a link to a ticket with a clear deadline of when it can be cleaned up/reverted

jstummbillig

The multi agent thing with different roles is so obviously not a great concept, that I am very hesitant to build towards it, even thought it seems to win out right now. We want a AI that internally does what it needs to do to solve a problem, given a good enough problem description, tools and context. I really do not want to have to worry about breaking up tasks into chunks that are smaller than what I could handle myself, and I really hope that that in the near future this will go away.

brabel

People creating products need to do what gives results right now. And I can attest that breaking up jobs into small steps seems to work better for most scenarios. When that becomes unnecessary, creating products that are useful will become much easier for sure, but I wouldn’t hold my breath.

nico

> 2.3 Specialized Micro-Agents Over Generalized Rules Initially, our instinct was to continuously add more rules into a single large prompt to handle edge cases

This has been my experience as well. However, it seems like the platforms like Cursor/Lovable/v0/et al are doing things differently

For example, this is Lovable’s leaked system prompt, 1550 lines: https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...

Is there a trick to making gigantic system prompts work well?

h1fra

what I saw using 5-6 tools like this:

- PR description is never useful they barely summarize the file changes

- 90% of comments are wrong or irrelevant wether it's because it's missing context, missing tribal knowledge, missing code quality rules or wrongly interpret the code change

- 5-10% of the time it actually spots something

Not entirely sure it's worth the noise

asdev

the code reviews can't be effective because the LLM does not have the tribal knowledge and product context of the change. it's just reading the code at face value

bwfan123

code-reviews are not a good use-case for LLMs. here's why: LLMs shine in usecases when their output is not evaluated on accuracy - for example, recommendations, semantic-search, sample snippets, images of people riding horses etc. code-reviews require accuracy.

What is a useful agent in the context of code-reviews in a large codebase is a semantic search agent which adds a comment containing related issues or PRs from the past for more context to human reviewers. This is a recommendation and is not rated on accuracy.

nzach

I agree with the sentiment of this post. I my personal experience the usefulness of a LLM positively correlated with your ability to constrain the problem it should solve.

Prompts like 'Update this regex to match this new pattern' generally give better results than 'Fix this routing error in my server'.

Although this pattern seems true empirically, I've never seen any hard data to confirm this property(?). And this post is interesting but seems like a missed opportunity to back this idea with some numbers.

exitb

This seems like really bad news for the „AI will soon replace all software developers” crowd.

kurtis_reed

There was a blog post from another AI code review tool: "How to Make LLMs Shut Up"

https://news.ycombinator.com/item?id=42451968

shenberg

When I read "51% fewer false positives" followed immediately by "Median comments per pull request cut by half" it makes me wonder how many true positives they find. That's maybe unfair as my reference is automated tooling in the security world, where the true-positive/false-positive ratio is so bad that a 50% reduction in false positives is a drop in the bucket

jangletown

"51% fewer false positives", how were you measuring? is this an internal or benchmarking dataset?

b0a04gl

we tried something simple. suprisingly exposed a lot; just ran same input twice through the agent, temp 0. diffed the reasoning trace token by token, didn't expect much honestly. but even small shifts showed up. one run said 'this may introduce risk'. other said 'this could cause issues'.. exact same code. made us realise prompt wasn't grounding the rationale path tight enough. wasn't hallucinating. just the why kept wobbling

mattas

"After extensive trial-and-error..."

IMO, this is the difference between building deterministic software and non-deterministic software (like an AI agent). It often boils down to randomly making tweaks and evaluating the outcome of those tweaks.

s1mplicissimus

Afaik alchemists had a more reliable method than ... whatever this state of affairs is ^^

snapcaster

You're saying alchemy is better than the scientific method?

AndrewKemendo

Otherwise known as science

1:Observation 2:Hypothesis 3:test 4:GOTO:1

This is every thing ever built ever

What is the problem exactly?

wrs

For one thing, what you learned can stop working when you switch to a new model, or just a newer version of the “same” model.

HN

Learnings from building AI agents

Learnings from building AI agents