The unreasonable effectiveness of an LLM agent loop with tool use

332 comments

·May 15, 2025

libraryofbabel

Strongly recommend this blog post too which is a much more detailed and persuasive version of the same point. The author actually goes and builds a coding agent from zero: https://ampcode.com/how-to-build-an-agent

It is indeed astonishing how well a loop with an LLM that can call tools works for all kinds of tasks now. Yes, sometimes they go off the rails, there is the problem of getting that last 10% of reliability, etc. etc., but if you're not at least a little bit amazed then I urge you go to and hack together something like this yourself, which will take you about 30 minutes. It's possible to have a sense of wonder about these things without giving up your healthy skepticism of whether AI is actually going to be effective for this or that use case.

This "unreasonable effectiveness" of putting the LLM in a loop also accounts for the enormous proliferation of coding agents out there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider, Codex... and a ton of also-rans; as one HN poster put it the other day, it seems like everyone and their mother is writing one. The reason is that there is no secret sauce and 95% of the magic is in the LLM itself and how it's been fine-tuned to do tool calls. One of the lead developers of Claude Code candidly admits this in a recent interview.[0] Of course, a ton of work goes into making these tools work well, but ultimately they all have the same simple core.

[0] https://www.youtube.com/watch?v=zDmW5hJPsvQ

deadbabe

Generally when LLM’s are effective like this, it means a more efficient non-LLM based solution to the problem exists using the tools you have provided. The LLM helps you find the series of steps and synthesis of inputs and outputs to make it happen.

It is expensive and slow to have an LLM use tools all the time for solving the problem. The next step is to convert frequent patterns of tool calls into a single pure function, performing whatever transformation of inputs and outputs are needed along the way (an LLM can help you build these functions), and then perhaps train a simple cheap classifier to always send incoming data to this new function, bypassing LLMs all together.

In time, this will mean you will use LLMs less and less, limiting their use to new problems that are unable to be classified. This is basically like a “cache” for LLM based problem solving, where the keys are shapes of problems.

The idea of LLMs running 24/7 solving the same problems in the same way over and over again should become a distant memory, though not one that an AI company with vested interest in selling as many API calls as possible will want people to envision. Ideally LLMs are only needed to be employed once or a few times per novel problem before being replaced with cheaper code.

toobulkeh

I’ve been tinkering with this, but haven’t found a pattern or library of someone solving this.

Have you?

vidarh

There's a Ruby port of the first article you linked as well. Feature-wise they're about the same, but if you (like me) enjoy Ruby more than Python it's worth reading both articles:

https://news.ycombinator.com/item?id=43984860

https://radanskoric.com/articles/coding-agent-in-ruby

forgingahead

Love to see the Ruby implementations! Thanks for sharing.

ichiwells

Thank you so much for sharing this!

We are using ruby to build a powerful AI toolset in the construction space, and we love how simple all of the SaaS parts are and not reinventing the wheel, but the ruby LLM SDK ecosystem is a bit lagging, so we've written a lot of our own low-level tools.

(btw we are also hiring rubyists https://news.ycombinator.com/item?id=43865448)

datpuz

Can't think of anything an LLM is good enough at to let them do on their own in a loop for more than a few iterations before I need to reign it back in.

hbbio

That's why in practice you need more than this simple loop!

Pretty much WIP, but I am experimenting with simple sequence-based workflows that are designed to frequently reset the conversation [2]

This goes well with Microsoft paper "LLMs Get Lost In Multi-Turn Conversation " that was published Friday [1].

- [1]: https://arxiv.org/abs/2505.06120

- [2]: https://github.com/hbbio/nanoagent/blob/main/src/workflow.ts

Groxx

They're extremely good at burning through budgets, and get even better when unattended

_kb

Maximising paperclip production too.

mycall

Is that really true? I though there free models and $200 all you can eat models.

CuriouslyC

The main problem with agents is that they aren't reflecting on their own performance and pausing their own execution to ask a human for help aggressively enough. Agents can run on for 20+ iterations in many cases successfully, but also will need hand holding after every iteration in some cases.

They're a lot like a human in that regard, but we haven't been building that reflection and self awareness into them so far, so it's like a junior that doesn't realize when they're over their depth and should get help.

vendiddy

I think they are capable of doing it, but it requires prompting.

I constantly have to instruct them: - Go step by step, don't skip ahead until we're done with a step - Don't make assumptions, if you're unsure ask questions to clarify

And they mostly do this.

But this needs to be default behavior!

I'm surprised that, unless prompted, LLMs never seem to ask follow-up questions as a smart coworker might.

ariwilson

Is there value in adding an overseer LLM that measures the progress between n steps and if it's too low stops and calls out to a human?

vidarh

They've written most of the recent iterations of X11 bindings for Ruby, including a complete, working example of a systray for me.

They also added the first pass of multi-monitor support for my WM while I was using it (restarted it repeatedly while Claude Code worked, in the same X session the terminal it was working in was running).

You do need to reign them back in, sure, but they can often go multiple iterations before they're ready to make changes to your files once you've approved safe tool uses etc.

TZubiri

How do they read the screen?

datpuz

Agents? Doubt.

eru

The hope is that the ground truth from calling out to tools (like compilers or test runs) will eventually be enough keep them on track.

Just like humans and human organisations also tend to experience drift, unless anchored in reality.

mkagenius

I built android-use[1] using LLM. It is pretty good at self healing due to the "loop", it constantly checks if the current step is actually a progress or regress and then determines next step. And the thing is nothing is explicitly coded, just a nudge in the prompts.

1. clickclickclick - A framework to let local LLMs control your android phone (https://github.com/BandarLabs/clickclickclick)

loa_in_

You don't have to. Most of the appeal is automatically applying fixes like "touch file; make" after spotting a trivial mistake. Just let it at it.

JeremyNT

Definitely true currently, which is why there's so much focus on using them to write real code that humans have to actually commit and put their names on.

Longer term, I don't think this holds due to the nature of capitalism.

If given a choice between paying for an LLM to do something that's mostly correct versus paying for a human developer, businesses are going to choose the former, even if it results in accelerated enshittification. It's all in service of reducing headcount and taking control of the means of production away from workers.

meander_water

There's also this one which uses pocketflow, a graph abstraction library to create something similar [0]. I've been using it myself and love the simplicity of it.

[0] https://github.com/The-Pocket/PocketFlow-Tutorial-Cursor/blo...

wepple

Ah, it’s Thorsten Ball!

I thoroughly enjoyed his “writing an interpreter”. I guess I’m going to build an agent now.

null

[deleted]

sesm

Should we change the link above to use `?utm_source=hn&utm_medium=browser` before opening it?

libraryofbabel

fixed :)

orange_puff

I have been trying to find such an article for so long, thank you! I think a common reaction to Agents is “well, it probably cannot solve a really complex problem very well”. But to me, that isn’t the point of an agent. LLMs function really well with a lot of context, and agent allows the LLM to discover more context and improve its ability to answer questions.

xnx

> The reason is that there is no secret sauce and 95% of the magic is in the LLM itself

Makes that "$3 billion" valuation for Windsurf very suspect

TonyEx

The value in the windsurf acquisition isn't the code they've written, it's the ability to see what people are coding and use that information to build better LLMs. -- Product development.

rrrx3

Indeed. But keep in mind they weren't just buying the tooling - they get the team, the brand, and the positional authority as well. OpenAI could have spun up a team to build an agent code IDE, and they would have been starting on the back foot with users, would have been compared to Cursor/Windsurf...

The price tag is hefty but I figure it'll work out for them on the backside because they won't have to fight so hard to capture TAM.

kgeist

Today I tried "vibe-coding" for the first time using GPT-4o and 4.1. I did it manually - just feeding compilation errors, warnings, and suggestions in a loop via the canvas interface. The file was small, around 150 lines.

It didn't go well. I started with 4o:

- It used a deprecated package.

- After I pointed that out, it didn't update all usages - so I had to fix them manually.

- When I suggested a small logic change, it completely broke the syntax (we're talking "foo() } return )))" kind of broken) and never recovered. I gave it the raw compilation errors over and over again, but it didn't even register the syntax was off - just rewrote random parts of the code instead.

- Then I thought, "maybe 4.1 will be better at coding" (as advertized). But 4.1 refused to use the canvas at all. It just explained what I could change - as in, you go make the edits.

- After some pushing, I got it to use the canvas and return the full code. Except it didn't - it gave me a truncated version of the code with comments like "// omitted for brevity".

That's when I gave up.

Do agents somehow fix this? Because as it stands, the experience feels completely broken. I can't imagine giving this access to bash, sounds way too dangerous.

simonw

"It used a deprecated package"

That's because models have training cut-off dates. It's important to take those into account when working with them: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#a...

I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

You can tell it "look up the most recent version of library X and use that" and it will often work!

I even used it for a frustrating upgrade recently - I pasted in some previous code and prompted this:

This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

It did exactly what I asked: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

kgeist

>That's because models have training cut-off dates

When I pointed out that it used a deprecated package, it agreed and even cited the correct version after which it was deprecated (way back in 2021). So it knows it's deprecated, but the next-token prediction (without reasoning or tools) still can't connect the dots when much of the training data (before 2021) uses that package as if it's still acceptable.

>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

Thanks for the tip!

jmcpheron

>I've switched to o4-mini-high via ChatGPT as my default model for a lot of code because it can use its search function to lookup the latest documentation.

That is such a useful distinction. I like to think I'm keeping up with this stuff, but the '4o' versus 'o4' still throws me.

fragmede

There's still skill involved with using the LLM in coding. In this case, o4-mini-high might do the trick, but the easier answer that worry's with other models is to include the high level library documentation yourself as context and it'll use that API.

mbesto

> That's because models have training cut-off dates.

Which is precisely the issue with the idea of LLMs completely replacing human engineers. It doesn't understand this context unless a human tells it to understand that context.

simonw

Right: the idea that LLMs are a replacement for human engineers is deeply flawed in my opinion.

sagarpatil

Context7 MCP solves this. Use it with Cursor/Windsurf.

thorum

GPT 4.1 and 4o score very low on the Aider coding benchmark. You only start to get acceptable results with models that score 70%+ in my experience. Even then, don't expect it to do anything complex without a lot of hand-holding. You start to get a sense for what works and what doesn't.

https://aider.chat/docs/leaderboards/

bjt12345

That been said, Claude Sonnet 3.7 seems to do very well at a recursive approach to writing a program whereas other models don't fare as well.

k__

Sonnet 3.7 was SOTA for quite some time. I built some nice charts with it. It's a rather simple task, but quite LoC-intensive.

ebiester

I get that it's frustrating to be told "skill issue," but using an LLM is absolutely a skill and there's a combination of understanding the strengths of various tools, experimenting with them to understand the techniques, and just pure practice.

I think if I were giving access to bash, though, it would definitely be in a docker container for me as well.

th0ma5

Except the skill involved is believing in random people's advice that a different model will surely be better with no fundamental reason or justification as to why. The benchmarks are not applicable when trying to apply the models to new work and benchmarks by there nature do not describe suitability to any particular problem.

wtetzner

Sure, you can probably get better at it, but is it really worth the effort over just getting better at programming?

cheema33

If you are going to race a fighter jet, and you are on a bicycle, exercising more and eating right will not help. You have to use a better tool.

A good programmer with AI tools will run circles around a good programmer without AI tools.

ebiester

Yes, not because you will be able to solve harder problems, but because you will be able to more quickly solve easier problems which will free up more time to get better at programming, as well as get better at the domain in which you're programming. (That is, talking with your users.)

null

[deleted]

drittich

Perhaps that's a false dichotomy?

cyral

You can do both

codethief

The other day I used the Cline plugin for VSCode with Claude to create an Android app prototype from "scratch", i.e. starting from the usual template given to you by Android Studio. It produced several thousand lines of code, there was not a single compilation error, and the app ended up doing exactly what I wanted – modulo a bug or two, which were caused not by the LLM's stupidity but by weird undocumented behavior of the rather arcane Android API in question. (Which is exactly why I wanted a quick prototype.)

After pointing out the bugs to the LLM, it successfully debugged them (with my help/feedback, i.e. I provided the output of the debug messages it had added to the code) and ultimately fixed them. The only downside was that I wasn't quite happy with the quality of the fixes – they were more like dirty hacks –, but oh well, after another round or two of feedback we got there, too. I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

cheema33

> I'm sure one could solve that more generally, by putting the agent writing the code in a loop with some other code reviewing agent.

This x 100. I get so much better quality code if I have LLMs review each other's code and apply corrections. It is ridiculously effective.

lftl

Can you elaborate a little more on your setup? Are you manually copyong and pasting code from one LLM to another, or do you have some automated workflow for this?

suddenlybananas

What was the app? It could plausibly be something that has an open source equivalent already in the training data.

nico

4o and 4.1 are not very good at coding

My best results are usually with 4o-mini-high, o3 is sometimes pretty good

I personally don’t like the canvas. I prefer the output on the chat

And a lot of times I say: provide full code for this file, or provide drop-in replacement (when I don’t want to deal with all the diffs). But usually at around 300-400 lines of code, it starts getting bad and then I need to refactor to break stuff up into multiple files (unless I can focus on just one method inside a file)

manmal

o3 is shockingly good actually. I can’t use it often due to rate limiting, so I save it for the odd occasion. Today I asked it how I could integrate a tree of Swift binary packages within an SDK, and detect internal version clashes, and it gave a very well researched and sensible overview. And gave me a new idea that I‘ll try.

kenjackson

I use o3 for anything math or coding related. 4o is good for things like, "my knee hurts when I do this and that -- what might it be?"

hnhn34

Just in case you didn't know, they raised the rate limit from ~50/week to ~50/day a while ago

johnsmith1840

Drop in replacement files per update should be done on the heavy test time compute methods.

o1-pro, o1-preview can generate updated full file responses into the 1k LOC range.

It's something about their internal verification methods that make it an actual viable development method.

nico

True. Also, the APIs don't care too much about restricting output length, they might actually be more verbose to charge more

It's interesting how the same model being served through different interfaces (chat vs api), can behave differently based on the economic incentives of the providers

danbmil99

As others have noted, you sound about 3 months behind the leading edge. What you describe is like my experience from February.

Switch to Claude (IMSHO, I think Gemini is considered on par). Use a proper coding tool, cutting & pasting from the chat window is so last week.

candiddevmike

Instead of churning on frontend frameworks while procrastinating about building things we've moved onto churning dev setups for micro gains.

latentsea

The amount of time spent churning on workflows and setups will offset the gains.

It's somewhat ironic the more behind the leading edge you are, the more efficient it is to make the gains eventually because you don't waste time on the micro-gain churn, and a bigger set of upgrades arrives when you get back on the leading edge.

I watched this dynamic play out so many times in the image generation space with people spending hundreds of hours crafting workflows to get around deficiencies in models, posting tutorials about it, other people spending all the time to learn those workflows. New model comes out and boom, all nullified and the churn started all over again. I eventually got sick of the churn. Batching the gains worked better.

mycall

> churning dev setups for micro gains.

Devs have been doing micro changes to their setup for 50 years. It is the nature of their beast.

fsndz

I can be frustrating at times. but my experience is the more you try the better you become at knowing what to ask and to expect. But I guess you understand now why some people say vibe coding is a bit overrated: https://www.lycee.ai/blog/why-vibe-coding-is-overrated

the_af

"Overrated" is one way to call it.

Giving sharp knives to monkeys would be another.

lnenad

Why do people keep thinking they're intellectually superior when negatively evaluating something that is OBVIOUSLY working for a very large percentage of people?

baq

Vibe coding has a vibe component and a coding component. Take away the coding and you’re only left with vibe. Don’t confuse the two.

Saying that as I’ve got vibe coded react internal tooling used in production without issues, saved days of work easily.

zo1

I'd rather give my green or clueless or junior or inexperienced devs said knives than having them throw spaghetti on a wall for days on end, only to have them still ask a senior to help or do the work for them anyways.

visarga

You should try Cursor or Windsurf, with Claude or Gemini model. Create a documentation file first. Generate tests for everything. The more the better. Then let it cycle 100 times until tests pass.

Normal programming is like walking, deliberate and sure. Vibe coding is like surfing, you can't control everything, just hit yes on auto. Trust the process, let it make mistakes and recover on its own.

tqwhite

I find that writing a thorough design spec is really worth it. Also, asking for its reaction. "What's missing?" "Should I do X or Y" does good things for its thought process, like engaging a younger programmer in the process.

Definitely, I ask for a plan and then, even if it's obvious, I ask questions and discuss it. I also point it as samples of code that I like with instructions for what is good about it.

Once we have settled on a plan, I ask it to break it into phases that can be tested (I am not one for a unit testing) to lock in progress. Claude LOVES that. It organizes a new plan and, at the end of each phase tells me how to test (curl, command line, whatever is appropriate) and what I should see that represents success.

The most important thing I have figured out is that Claude is a collaborator, not a minion. I agree with visarga, it's much more like surfing that walking. Also, Trust... but Verify.

This is a great time to be a programmer.

prisenco

Given that analogy, surely you could understand why someone would much rather walk than surf to their destination? Especially people who are experienced marathon runners.

fragmede

If I tried standing up on the waves without a surfboard, and complain about how it's not working, would you blame the water or surfing for the issue, or the person trying to defy physics, complaining that it's not working? It doesn't matter how much I want to run or if I'm Kelvin Kiptum, I'm gonna have a bad time.

simonw

I'm very excited about tool use for LLMs at the moment.

The trick isn't new - I first encountered it with the ReAcT paper two years ago - https://til.simonwillison.net/llms/python-react-pattern - and it's since been used for ChatGPT plugins, and recently for MCP, and all of the models have been trained with tool use / function calls in mind.

What's interesting today is how GOOD the models have got at it. o3/o4-mini's amazing search performance is all down to tool calling. Even Qwen3 4B (2.6GB from Ollama, runs happily on my Mac) can do tool calling reasonably well now.

I gave a workshop at PyCon US yesterday about building software on top of LLMs - https://simonwillison.net/2025/May/15/building-on-llms/ - and used that as an excuse to finally add tool usage to an alpha version of my LLM command-line tool. Here's the section of the workshop that covered that:

https://building-with-llms-pycon-2025.readthedocs.io/en/late...

My LLM package can now reliably count the Rs in strawberry as a shell one-liner:

  llm --functions '
  def count_char_in_string(char: str, string: str) -> int:
      """Count the number of times a character appears in a string."""
      return string.lower().count(char.lower())
  ' 'Count the number of Rs in the word strawberry' --td

andrewmcwatters

I love the odd combination of silliness and power in this.

DarmokJalad1701

Was the workshop recorded?

simonw

No video or audio, just my handouts.

tqwhite

I've been using Claude Code, ie, a terminal interface to Sonnet 3.7 since the day it came out in mid March. I have done substantial CLI apps, full stack web systems and a ton of utility crap. I am much more ambitious because of it, much as I was in the past when I was running a programming team.

I'm sure it is much the same as this under the hood though Anthropic has added many insanely useful features.

Nothing is perfect. Producing good code requires about the same effort as it did when I was running said team. It is possible to get complicated things working and find oneself in a mess where adding the next feature is really problematic. As I have learned to drive it, I have to do much less remediation and refactoring. That will never go away.

I cannot imagine what happened to poor kgeist. I have had Claude make choices I wouldn't and do some stupid stuff, never enough that I would even think about giving up on it. Almost always, it does a decent job and, for a most stuff, the amount of work it takes off of my brain is IMMENSE.

And, for good measure, it does a wonderful job of refactoring. Periodically, I have a session where I look at the code, decide how it could be better and instruct Claude. Huge amounts of complexity, done. "Change this data structure", done. It's amazingly cool.

And, just for fun, I opened it in a non-code archive directory. It was a junk drawer that I've been filling for thirty years. "What's in this directory?" "Read the old resumes and write a new one." "What are my children's names?" Also amazing.

And this is still early days. I am so happy.

felipeerias

Recently I had to define a remote data structure, specify the API to request it, implement parsing and storage, and show it to the user.

Claude was able to handle all of these tasks simultaneously, so I could see how small changes at either end would impact the intermediate layers. I iterated on many ideas until I settled on the best overall solution for my use case.

Being able to iterate like that through several layers of complexity was eye-opening. It made me more productive while giving me a better understanding of how the different components fit together.

benoau

> And, for good measure, it does a wonderful job of refactoring. Periodically, I have a session where I look at the code, decide how it could be better and instruct Claude. Huge amounts of complexity, done. "Change this data structure", done. It's amazingly cool.

Yeah this is literally just so enjoyable. Stuff that would be an up-hill battle to get included in a sprint takes 5 minutes. It makes it feel like a whole team is just sitting there, waiting to eagerly do my bidding with none of the headache waiting for work to be justified, scheduled, scoped, done, and don't even have to justify rejecting it if I don't like the results.

betadeltic

> "What are my children's names?"

Elon?

suninsight

It only seems effective, unless you start using it for actual work. The biggest issue - context. All tool use creates context. Large code bases come with large context out of the bat. LLM's seem to work, unless they are hit with a sizeable context. Anything above 10k and the quality seems to deteriorate.

Other issue is that LLM's can go off on a tangent. As context builds up, they forget what their objective was. One wrong turn, and in the rabbit hole they go never to recover.

The reason I know, is because we started solving these problems an year back. And we aren't done yet. But we did cover a lot of distance.

[Plug]: Try it out at https://nonbios.ai:

- Agentic memory → long-horizon coding

- Full Linux box → real runtime, not just toy demos

- Transparent → see & control every command

- Free beta — no invite needed. Works with throwaway email (mailinator etc.)

bob1029

> One wrong turn, and in the rabbit hole they go never to recover.

I think this is probably at the heart of the best argument against these things as viable tools.

Once you have sufficiently described the problem such that the LLM won't go the wrong way, you've likely already solved most of it yourself.

Tool use with error feedback sounds autonomous but you'll quickly find that the error handling layer is a thin proxy for the human operator's intentions.

suninsight

Yes, but we dont believe that this is a 'fundamental' problem. We have learnt to guide their actions a lot better and they go down the rabbit a lot less now than when we started out.

k__

True, but on the other hand, there are a bunch of tasks that are just very typing intensive and not really complex.

Especially in GUI development, building forms, charts, etc.

I could imagine that LLMs are a great help here.

moffkalast

Some of the thinking models might recover... with an extra 4k tokens used up in <thinking>. And even if they were stable at long contexts, the speed drops massively. You just can't win with this architecture lol.

suninsight

That is very accurate with what we have found. <thinking> models do a lot better, but with huge speed drops. For now, we have chosen accuracy over speed. But speed drop is like 3-4x - so we might move to an architecture where we 'think' only sporadically.

Everything happening in the LLM space is so close to how humans think naturally.

toit4wing

Looks interesting. How do you manage context ?

suninsight

So managing context is what takes the maximum effort. We use a bunch of strategies to reduce it, including, but not limited to:

1. Custom MCP server to work on linux command line. This wasn't really a 'MCP' server because we started working on it before MCP was a thing. But thats the easiest way to explain it now. The MCP server is optimised to reduce context.

2. Guardrails to reduce context. Think about it as prompt alterations giving the LLM subtle hints to work with less context. The hints could be at a behavioural level and a task level.

3. Continuously pruning the built up context to make the Agent 'forget'. Forgetting what is not important is what we believe a foundational capability.

This is kind of inspired by the science which says humans use sleep to 'forget' not useful memories and is critical to keeping the brain healthy. This translates directly to LLM's - making them forget is critical to keep them focussed on the larger task and their actions alligned.

benoau

This morning I used cursor to extract a few complex parts of my game prototype's "main loop", and then generate a suite of tests for those parts. In total I have 341 tests written by Cursor covering all the core math and other components.

It has been a bit like herding cats sometimes, it will run away with a bad idea real fast, but the more constraints I give it telling it what to use, where to put it, giving it a file for a template, telling it what not to do, the better the results I get.

In total it's given me 3500 lines of test code that I didn't need to write, don't need to fix, and can delete and regenerate if underlying assumptions change. It's also helped tune difficulty curves, generate mission variations and more.

otterley

Writing tests, in my experience, is by far the best use case for LLMs. It eliminates hours or days of drudgery and toil, and can provide coverage of so many more edge cases that I can think of myself. Plus it makes your code more robust! It’s just wonderful all around.

benoau

Yeah I generated another 60 tests / 500 lines of test code since I posted my comment. I must have written 100,000s of lines of test codes the hard way, it's not just faster than me it's better than me when it comes to the coverage, and it requires less of my help than a number of developers I've had under me.

cadamsdotcom

> "Oh, this test doesn't pass... let's just skip it," it sometimes says, maddeningly.

Here is a wild idea. Imagine running a companion, policy-enforcing LLM, independently and in parallel, which is given instructions to keep the main LLM behaving according to instructions.

If the companion LLM could - in real time - ban the coding LLM from emitting "let's just skip it" by seeing the tokens "let's just" and then biasing the output such that the word "skip" becomes impossible to emit.

Banning the word "skip" from following "let's just", forces the LLM down a new path away from the undesired behavior.

It's like Structured Outputs or JSON mode, but driven by a companion LLM, and dynamically modified in real time as tokens are emitted.

If the idea works, you could prompt the companion LLM to do more advanced stuff - eg. ban a coding LLM from making tests pass by deleting the test code, ban it from emitting pointless comments... all the policies that we put into system prompts today and pray the LLM will do, would go into the companion LLM's prompt instead.

Wonder what the Outlines folks think of this!

JoshuaDavid

Along these lines, if the main LLM goes down a bad path, you could _rewind_ the model to before it started going down the bad path -- the watcher LLM doesn't necessarily have to guess that "skip" is a bad token after the words "let's just", it could instead see "let's just skip the test" and go "nope, rolling back to the token "just " and rerolling with logit_bias={"skip":-10,"omit":-10,"hack":-10}".

Of course doing that limits which model providers you can work with (notably, OpenAI has gotten quite hostile to power users doing stuff like that over the past year or so).

cadamsdotcom

That’s a really neat idea.

Kind of seems an optimization: if the “token ban” is a tool call, you can see that being too slow to run for every token. Provided rewinding is feasible, your idea could make it performant enough to be practical.

TeMPOraL

It's not an optimization; the greedy approach won't work well, because it rejects critical context that comes after the tokens that would trigger the rejection.

Consider: "Let's just skip writing this test, as the other change you requested seems to eliminate the need for it."

Rolling back the model on "Let's just" would be stupid; rolling it on "Let's just skip writing this test" would be stupid too, as is the beliefs that writing tests is your holy obligation to your god, and you must do so unconditionally. The following context makes it clear that the decision is fine. Or, if you (or the governor agent) don't buy the reasoning, you're then in a perfect position to say, "nope, let's roll back to <Let's> and bias against ["skip, test"]".

Checking the entire message once makes sense; checking it after every token doesn't.

phi0

A team at OpenAI tried to do something similar [1] where they had a 4o monitor o3-mini for when it was tryting to bypass tests because they were too hard. They trained o3-mini and punished it when it did that. All that did was make o3-mini still reward-hack, but such that 4o was unable to detect it, going from saying:

> To skip all tests, we can define a `pytest_runtest_setup` function that always skips.

to saying:

> Idea: Starting a new build with a random first succession, calling random_new_succession should not increment succession, but in implementation, it does. Adjust to prevent increment. Implement fix to random_new_build not to increment succession, or test adapted to behavior.

while then doing exactly the same thing (skipping the tests)

Even without training, it's only a temporary band-aid. If the incentive for reward-hacking becomes high enough, it will simply start phrasing it in different, not-possible-to-detect ways.

[1]: https://openai.com/index/chain-of-thought-monitoring/

panarky

If it works to run a second LLM to check the first LLM, then why couldn't a "mixture of experts" LLM dedicate one of its experts to checking the results of the others? Or why couldn't a test-time compute "thinking" model run a separate thinking thread that verifies its own output? And if that gets you 60% of the way there, then there could be yet another thinking thread that verifies the verifier, etc.

somebodythere

Because if the agent and governor are trained together, the shared reward function will corrupt the governor.

panarky

The shared reward function from pre-training is like primary school for an LLM. Maybe RLHF is like secondary school. The governor can be differentiated from the workers with different system and user prompts, fine tuning, etc., which might be similar to medical school or law school for a human.

Certainly human judges, attorneys for defense and prosecution, and members of the jury can still perform their jobs well even if they attended the same primary and secondary schools.

magicalhippo

Assuming the title is a play on the paper "The Unreasonable Effectiveness of Mathematics in the Natural Sciences"[1][2] by Eugene Wigner.

[1]: https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...

[2]: https://www.hep.upenn.edu/~johnda/Papers/wignerUnreasonableE...

dsubburam

I didn't know of that paper, and thought the title was a riff on Karpathy's Unreasonable Effectiveness of RNNs in 2015[1]. Even if my thinking is correct, as it very well might be given the connection RNNs->LLMs, Karpathy might have himself made his title a play on Wigner's (though he doesn't say so).

[1] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

throwaway314155

Unreasonable effectiveness of [blah] has been a thing for decades if not centuries. It's not new.

temp0826

It's an older version of "x hates this one weird trick!"

gavmor

That may be its primogenitor, but it's long since become a meme: https://scholar.google.com/scholar?q=unreasonable+effectiven...

-__---____-ZXyw

a. Primogenitor is a nice word!

b. Wigner's original essay is a serious piece, and quite beautiful in its arguments. I had been under the impression that the phrasing had been used a few times since, but typically by other serious people who were aware of the lineage of that lovely essay. With this 6-paragraph vibey-blog-post, it truly has become a meme. So it goes, I suppose.

chongli

It’s also funny to me because every time I encounter “unreasonable effectiveness” in a headline I tend to strongly disagree with the author’s conclusion (including Wigner’s). It’s become one of those Betteridge-style laws for me.

outworlder

> If you don't have some tool installed, it'll install it.

Terrifying. LLMs are very 'accommodating' and all they need is someone asking them to do something. This is like SQL injection, but worse.

rglover

I often wonder what the first agent-driven catastrophe will be. Considering the gold rush (emphasis on rush) going on, it's only a matter of time before a difficult to fix disaster occurs.

nxobject

In an ideal world, this would require us as programmers to lean into our codebase reading and augmentation skills – currently underappreciated anyway as a skill to build. But when the incentives lean towards write-only code, I'm not optimistic.

_bin_

I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.

3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.

I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.

agilebyte

I am avoiding the cost of API access by using the chat/ui instead, in my case Google Gemini 2.5 Pro with the high token window. Repomix a whole repo. Paste it in with a standard prompt saying "return full source" (it tends to not follow this instruction after a few back and forths) and then apply the result back on top of the repo (vibe coded https://github.com/radekstepan/apply-llm-changes to help me with that). Else yeah, $5 spent on Cline with Claude 3.7 and instead of fixing my tests, I end up with if/else statements in the source code to make the tests pass.

actsasbuffoon

I decided to experiment with Claude Code this month. The other day it decided the best way to fix the spec was to add a conditional to the test that causes it to return true before getting to the thing that was actually supposed to be tested.

I’m finding it useful for really tedious stuff like doing complex, multi step terminal operations. For the coding… it’s not been great.

nico

I’ve had this in different ways many times. Like instead of resolving the underlying issue for an exception, it just suggests catching the exception and keep going

It also depends a lot on the mix of model and type of code and libraries involved. Even in different days the models seem to be more or less capable (I’m assuming they get throttled internally - this is very noticeable sometimes in how they try to save on output tokens and summarize the code responses as much as possible, at least in the chat/non-api interfaces)

christophilus

Well, that’s proof that it used my GitHub projects in its training data.

nico

Cool tool. What format does it expect from the model?

I’ve been looking for something that can take “bare diffs” (unified diffs without line numbers), from the clipboard and then apply them directly on a buffer (an open file in vscode)

None of the paste diff extension for vscode work, as they expect a full unified diff/patch

I also tried a google-developed patch tool, but also wasn’t very good at taking in the bare diffs, and def couldn’t do clipboard

agilebyte

Markdown format with a comment saying what the file path is. So:

This is src/components/Foo.tsx

```tsx // code goes here ```

```tsx // src/components/Foo.tsx // code goes here ```

These seem to work the best.

I tried diff syntax, but Gemini 2.5 just produced way too many bugs.

I also tried using regex and creating an AST of the markdown doc and going from there, but ultimately settled on calling gpt-4.1-mini-2025-04-14 with the beginning of the code block (```) and 3 lines before and 3 lines after the beginning of the code block. It's fast/cheap enough to work.

Though I still have to make edits sometimes. WIP.

never_inline

Aider has a --copy-paste mode which can pass in relevant context to web chat UI and you can paste back the LLM answer.

harvey9

Guess it was trained by scraping thedailywtf.com

johnsmith1840

I seem to be alone in this but the only methods truly good at coding are slow heavy test time compute models.

o1-pro and o1-preview are the only models I've ever used that can reliably update and work with 1000 LOC without error.

I don't let o3 write any code unless it's very small. Any "cheap" model will hallucinate or fail massively when pushed.

One good tip I've done lately. Remove all comments in your code before passing or using LLMs, don't let LLM generated comments persist under any circumstance.

_bin_

Interesting. I've never tested o1-pro because it's insanely expensive but preview seemed to do okay.

I wouldn't be shocked if huge, expensive-to-run models performed better and if all the "optimized" versions were actually labs trying to ram cheaper bullshit down everyone's throat. Basically chinesium for LLMs; you can afford them but it's not worth it. I remember someone saying o1 was, what, 200B dense? I might be misremembering.

johnsmith1840

I'm positive they are pushing users to cheaper models due to cost. o1-pro is now in a sub menu for pro users and labled legacy. The big inference methods must be stupidly expensive.

o1-preview was and possibly still is the most powerful model they ever released. I only switched to pro for coding after months of them improving it and my api bill getting a bit crazy (like 0.50$ per question).

I don't think paramater count matters anymore. I think the only thing that matters is how much compute a vendor will give you per question.

doug_durham

I never have LLMs work on 1000 LOC. I don't think that's the value-add. Instead I use it a the function and class level to accelerate my work. The thought of having any agent human or computer run amok in my code makes me uncomfortable. At the end of the day I'm still accountable for the work, and I have to read and comprehend everything. If do it piecewise I it makes tracking the work easier.

johnsmith1840

Big test time compute LLMs can easily handle 1k depending on logic density and prompt densitity.

Never an agent, every independent step an LLM takes is dangerous. My method is much more about taking the largest and safest single step at a time possible. If it can't do it in one step I narrow down until it can.

layoric

I've been using Mistral Medium 3 last couple of days, and I'm honestly surprised at how good it is. Highly recommend giving it a try if you haven't, especially if you are trying to reduce costs. I've basically switched from Claude to Mistral and honestly prefer it even if costs were equal.

nico

How are you running the model? Mistral’s api or some local version through ollama, or something else?

layoric

Through OpenRouter, medium 3 isn't open weights.

kyleee

Is mistral on open router?

kuahyeow

What protection do people use when enabling an LLM to run `bash` on your machine ? Do you run it in a Docker container / LXC boundary ? `chroot` ?

CGamesPlay

The blog post in question is on the site for Sketch, which appears to use Docker containers. That said, I use Claude Code, which just uses unsandboxed commands with manual approval.

What's your concern? An accident or an attacker? For accidents, I use git and backups and develop in a devcontainer. For an attacker, bash just seems like an ineffective attack vector; I would be more worried about instructing the agent to write a reverse shell directly into the code.

cess11

Bourne-Again SHell is a shell. The "reverse" part is just about network minutiae. /bin/sh is more portable so that's what you'd typically put in your shellcode but /bin/bash or /usr/bin/dash would likely work just as well.

I.e. exposing any of these on a public network is the main target to get a foothold in a non-public network or a computer. As soon as you have that access you can start renting out CPU cycles or use it for distributed hash cracking or DoS-campaigns. It's simpler than injecting your own code and using that as a shell.

Asking a few of my small local models for Forth-like interpreters in x86 assembly they seem willing to comply and produce code so if they had access to a shell and package installation I imagine they could also inject such a payload into some process. It would be very hard to discover.

zahlman

> For an attacker, bash just seems like an ineffective attack vector

All it needs to do is curl and run the actual payload.

Wilder7977

Or even the standard "echo xxx | base64 -d" or a million other ways. How can someone say that bash is not interesting to an attacker is beyond me.

null

[deleted]

mr_mitm

I run claude code in a podman container. It only gets access to the CWD. This comes with some downsides though, like your git config or other global configs not being available to the LLM (unless you fine tune the container, obviously).

jbellis

Yes!

Han Xiao at Jina wrote a great article that goes into a lot more detail on how to turn this into a production quality agentic search: https://jina.ai/news/a-practical-guide-to-implementing-deeps...

This is the same principle that we use at Brokk for Search and for Architect. (https://brokk.ai/)

The biggest caveat: some models just suck at tool calling, even "smart" models like o3. I only really recommend Gemini Pro 2.5 for Architect (smart + good tool calls); Search doesn't require as high a degree of intelligence and lots of models work (Sonnet 3.7, gpt-4.1, Grok 3 are all fine).

sagarpatil

“Claude Code, better than Sourcegraph, better than Augment Code.”

That’s a pretty bold claim, how come you are not at the top of this list then? https://www.swebench.com/

“Use frontier models like o3, Gemini Pro 2.5, Sonnet 3.7” Is this unlimited usage? Or number of messages/tokens?

Why do you need a separate desktop app? Why not CLI or VS Code extension.

crawshaw

I'm curious about your experiences with Gemini Pro 2.5 tool calling. I have tried using it in agent loops (in fact, sketch has some rudimentary support I added), and compared with the Anthropic models I have had to actively reprompt Gemini regularly to make tool calls. Do you consider it equivalent to Sonnet 3.7? Has it required some prompt engineering?

jbellis

Confession time: litellm still doesn't support parallel tool calls with Gemini models [https://github.com/BerriAI/litellm/issues/9686] so we wrote our own "parallel tool calls" on top of Structured Output. It did take a few iterations on the prompt design but all of it was "yeah I can see why that was ambiguous" kinds of things, no real complaints.

GP2.5 does have a different flavor than S3.7 but it's hard to say that one is better or worse than the other [edit: at tool calling -- GP2.5 is definitely smarter in general]. GP2.5 is I would say a bit more aggressive at doing "speculative" tool execution in parallel with the architect, e.g. spawning multiple search agent calls at the same time, which for Brokk is generally a good thing but I could see use cases where you'd want to dial that back.