Why the push for Agentic when models can barely follow a simple instruction?

128 comments

·October 14, 2025

varun_chopra

Marketing is being done really well in 2025, with brands injecting themselves into conversations on Reddit, LinkedIn, and every other public forum. [1]

CEOs, AI "thought leaders," and VCs are advertising LLMs as magic, and tools like v0 and Lovable as the next big thing. Every response from leaders is some variation of https://www.youtube.com/watch?v=w61d-NBqafM

On the ground, we know that creating CLAUDE.md or cursorrules basically does nothing. It’s up to the LLM to follow instructions, and it does so based on RNG as far as I can tell. I have very simple, basic rules set up that are never followed. This leads me to believe everyone posting on that thread on Cursor is an amateur.

Beyond this, if you’re working on novel code, LLMs are absolutely horrible at doing anything. A lot of assumptions are made, non-existent libraries are used, and agents are just great at using tokens to generate no tangible result whatsoever.

I’m at a stage where I use LLMs the same way I would use speech-to-text (code) - telling the LLM exactly what I want, what files it should consider, and it adds _some_ value by thinking of edge cases I might’ve missed, best practices I’m unaware of, and writing better grammar than I do.

Edit:

[1] To add to this, any time you use search or Perplexity or what have you, the results come from all this marketing garbage being pumped into the internet by marketing teams.

isodev

Coding with Claude feels like playing a slot machine. Sometimes you get more or less what you asked, sometimes totally not. I don’t think it’s wise or sane to leave them unattended.

joshvince

> if you’re working on novel code, LLMs are absolutely horrible

This is spot on. Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

What they are genuinely impressive at is parsing through large amounts of information to find something (eg: in a codebase, or in stack traces, or in logs). But this hype machine of 'agents creating entire codebases' is surely just smoke and mirrors - at least for now.

ehnto

> at least for now.

I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.

Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.

My thinking here is that we already had the technologies of the LLMs and the compute, but we hadn't yet had the reason and capital to deploy it at this scale.

So the surprising innovation of transformers did not give us the boost in capability itself, it still needed scale. The marketing that enabled the capital, that enables that scale was what caused the insane growth, and capital can't grow forever, it needs returns.

Scale has been exponential, and we are hitting an insane amount of capital deployment for this one technology that, has yet to prove commercially viable at the scale of a paradigm shift.

Are businesses that are not AI based, actually seeing ROI on AI spend? That is really the only question that matters, because if that is false, the money and drive for the technology vanishes and the scale that enables it disappears too.

NitpickLawyer

> I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.

??? It has already become exceptional. In 2.5 years (since chatgpt launched) we went from "oh, look how cute this is, it writes poems and the code almost looks like python" to "hey, this thing basically wrote a full programming language[1] with genz keywords, and it mostly works, still has some bugs".

I think the goalpost moving is at play here, and we quickly forget how 1 year makes a huge difference (last year you needed tons of glue and handwritten harnesses to do anything - see aider) and today you can give them a spec and get a mostly working project (albeit with some bugs), 50$ later.

[1] - https://github.com/ghuntley/cursed

delusional

> Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.

To comment om this, because its the most common counter argument. Most technology has worked in steps. We take a step forward, then iterate on essentially the same thing. It's very rare we see order of magnitude improvement on the same fundamental "step".

Cars were quite a step forward from donkeys, but modern cars are not that far off from the first ones. Planes were an amazing invention, but the next model of plane is basically the same thing as the first one.

wongarsu

I have had LLMs write entire codebases for me, so it's not like the hype is completely wrong. It's just that this only works if what you want is "boring", limited in scope and on a well-trodden path. You can have an LLM create a CRUD application in one go, or if you want to sort training data for image recognition you can have it generte a one-off image viewer with shortcuts tailored to your needs for this task. Those are powerful things and worthy of some hype. For anything more complex you very quickly run into limits and the time and effort to do it with an LLM quickly approaches the time and effort required to do it by hand.

physicsguy

They're powerful, but my feeling is that largely you could do this pre-LLM by searching on Stack Overflow or copying and pasting from the browser and adapting those examples, if you knew what you were looking for. Where it adds power is adapting it to your particular use case + putting it in the IDE. It's a big leap but not as enormous a leap as some people are making out.

Of course, if you don't know what you are looking for, it can make that process much easier. I think this is why people at the junior end find it is making them (a claimed) 10x more productive. But people who have been around for a long time are more skeptical.

alwahi

i have seen so many people say that, but the app stores/package managers aren't being flooded with thousands of vibe coded apps, meanwhile facebook is basically ai slop. can you share your github? or a gist of some of these "codebases"

0xAFFFF

> Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

Which makes sense, considering the absolutely massive amount of tutorials and basic HOWTOs that were present in the training data, as they are the easiest kind of programming content to produce.

vallavaraiyan

What is novel code?

  1. LLM's would suck at coming up with new algorithms. 
  2. I wouldn't let an LLM decide how to structure my code. Interfaces, module boundaries etc

Other than that, given the right context (the sdk doc for a unique hardware for eg) and a well organised codebase explained using CLAUDE.Md they work pretty well in filling out implementations. Just need to resist the temptation to prompt while the actual typing would take seconds.

vidarh

My experience is opposite to yours. I have had Claude Code fix issues in a compiler over the last week with very little guidance. Occasionally it gets frustrating, but most of the time Claude Code just churns through issue after issue, fixing subtle code generation and parser bugs with very little intervention. In fact, most of my intervention is tool weaknesses in terms of managing compaction to avoid running out of context at inopportune moments.

It's implemented methods I'd have to look up in books to even know about, and shown that it can get them working. It may not do much truly "novel" work, but very little code is novel.

They follow instructions very well if structured right, but you can't just throw random stuff in CLAUDE.md or similar. The biggest issue I've run into recently is that they need significant guidance on process. My instructions tends to focus on three separate areas: 1) debugging guidance for a given project (for my compiler project, that means things like "here's how to get an AST dumped from the compiler" and "use gdb to debug crashes" (it sometimes did that without being told, but not consistently; with the instructions it usually does tht), 2) acceptance criteria - this does need reiteration, 3) telling it to run tests frequently, make small, testable changes, and to frequently update a detailed file outlining the approach to be taken, progress towards it, and any outcomes of investigation during the work.

My experience is that with those three things in place, I can have Claude run for hours with --dangerously-skip-permissions and only step in to say "continue" or do a /compact in the middle of long runs, with only the most superficial checks.

It doesn't always provide perfect code every step. But neither do I. It does however usually move in the right direction every step, and has consistently produced progress over time with far less effort on my behalf.

I wouldn't have it start from scratch without at least some scaffolding that is architecturally sound yet, but it can often do that too, though that needs review before it "locks in" a bad choice.

I'm at a stage where I'm considering harnesses to let Claude work on a problem over the course of days without human intervention instead of just tens of minutes to hours.

nosianu

> My experience is opposite to yours.

But that is exactly the problem, no?

It is like, when you need some prediction (e.g. about market behavior), knowing that somewhere out there there is a person who will make the perfect one. However, instead of your problem being to make the prediction, now it is how to find and identify that expert. Is that type of problem that you converted yours into any less hard though?

I too had some great minor successes, the current products are definitely a great step forward. However, every time I start anything more complex I never know in advance if I end up with utterly unusable code, even after corrections (with the "AI" always confidently claiming that now it definitely fixed the problem), or something usable.

All those examples such as yours suffer from one big problem: They are selected afterwards.

To be useful, you would have to make predictions in advance and then run the "AI" and have your prediction (about its usefulness) verified.

Selecting positive examples after the work is done is not very helpful. All it does is prove that at least sometimes somebody gets something useful out of using an LLM for a complex problem. Okay? I think most people understand that by now.

PS/Edit: Also, success stories we only hear about but cannot follow and reproduce may have been somewhat useful initially, but by now most people are beyond that, willing to give it a try, and would like to have a link to the working and reproducible example. I understand that work can rarely be shared, but then those examples are not very useful any more at this point. What would add real value for readers of these discussions now is when people who say they were successful posted the full, working, reproducible example.

EDIT 2: Another thing: I see comments from people who say they did tweak CLAUDE.md and got it to work. But the point is predictability and consistency! If you have that one project where you twiddled around with the file and added random sentences that you thought could get the LLM to do what you need, that's not very useful. We already know that trying out many things sometimes yields results. But we need predictability and consistency.

We are used to being able to try stuff, and when we get it working we could almost always confidently say that we found the solution, and share it. But LLMs are not that consistent.

fragmede

I had a highly repetitive task (/subagents is great to know about), but I didn't get more advanced than a script that sent "continue\n" into the terminal where CC was running every X minutes. What was frustrating is CC was inconsistent with how long it would run. Needing to compact was a bit of a curveball.

alwahi

if claude generates the tests, runs those tests, applies the fixes without any oversight, it is a very "who watches the watchmen" situation.

ErikBjare

If you generate the tests, run those tests, apply fixes without any oversight, it is the very same situation. In reality, we have PR reviews.

fragmede

Gemini?

loveparade

> Beyond this, if you’re working on novel code, LLMs are absolutely horrible at doing anything. A lot of assumptions are made, non-existent libraries are used, and agents are just great at using tokens to generate no tangible result whatsoever.

Not my experience. I've used LLMs to write highly specific scientific/niche code and they did great, but obviously I had to feed them the right context (compiled from various websites and books convered to markdown in my case) to understand the problem well enough. That adds additional work on my part, but the net productivity is still very much positive because it's one-time setup cost.

Telling LLMs which files they should look at was indeed necessary 1-2 years ago in early models, but I have not done that for the last half year or so, and I'm working on codebases with millions of lines of code. I've also never had modern LLMs use nonexistent libraries. Sometimes they try to use outdated libraries, but it fails very quickly once they try to compile and they quickly catch the error and follow up with a web search (I use a custom web search provider) to find the most appropriate library.

I'm convinced that anybody who says that LLMs don't work for them just doesn't have a good mental model of HOW LLMs work, and thus can't use them effectively. Or their experience is just outdated.

That being said, the original issue that they don't always follow instructions from CLAUDE/AGENT.md files is quite true and can be somewhat annoying.

fnord123

> Not my experience. I've used LLMs to write highly specific scientific/niche code and they did great, but obviously I had to feed them the right context (compiled from various websites and books convered to markdown in my case) to understand the problem well enough. That adds additional work on my part, but the net productivity is still very much positive because it's one-time setup cost.

Which language are you using?

makingstuffs

> and it adds _some_ value by thinking of edge cases I might’ve missed, best practices I’m unaware of, and writing better grammar than I do.

This is my most consistent experience. It is great at catching the little silly things we do as humans. As such I have found them to be most useful as PR reviewers which you take with a pinch of salt

cube00

> It is great at catching the little silly things we do as humans.

It's great, some of the time, the great draw of computing was that it would always catch the silly things we do as humans.

If it didn't we'd change the change code and the next time (and forever onward) it would catch that case too.

Now we're playing wack-a-mole and pleading with words like "CRITICAL" and bold text to our in .cursorrules to the LLM pay attention, maybe it works today, might not work tomorrow.

Meanwhile the C-suite pushing these tools onto us still happily blame the developers when there's a problem.

You should have checked it! What are we paying you for?

null

[deleted]

saltysalt

Nailed it. The other side of the marketing hype cycle will be saner, when the market forces sort the wheat from the chaff.

apples_oranges

Too much money was invested, it needs to be sold.

Julien_r2

I actually hope to find better answers here than on cursor forum where people seems to be basically saying "it's you fault" instead of answering the actual question which is about trust, process, and real world use of agents..

So far it's just reinforcing my feeling that none of this is actually used at scale.. We use AI as relatively dumb companions, let them go wilder on side projects which have loser constraints, and Agent are pure hype (or for very niche use cases)

sdoering

What specific improvements are you hoping for? Without them (in the original forum post) giving concrete examples, prompts, or methodology – just stating "I write good prompts" – it's hard to evaluate or even help them.

They came in primed against agentic work flow. That is fine. But they also came in without providing anything that might have given other people the chance to show that their initial assumptions was flawed.

I've been working with agents daily for several months. Still learning what fails and what works reliably.

Key insights from my experience: - You need a framework (like agent-os or similar) to orchestrate agents effectively - Balance between guidance and autonomy matters - Planning is crucial, especially for legacy codebases

Recent example: Hit a wall with a legacy system where I kept maxing out the context window with essential background info. After compaction, the agent would lose critical knowledge and repeat previous mistakes.

Solution that worked: - Structured the problem properly - Documented each learning/discovery systematically - Created specialized sub-agents for specific tasks (keeps context windows manageable)

Only then could the agent actually help navigate that mess of legacy code.

rafaelmn

So at what point are you doing more work on the agent than working on the code directly ? And what are you losing in the process of shifting from the code author to LLM manager ?

My experience is that once I switch to this mode when something blows up I'm basically stuck with a bunch of code that I sort of know, even tough I reviewed it. I just don't have the same insight as I would if I wrote the code, no ownership, even if it was committed in my name. Like any misconceptions I've had about how things work I will still have because I never had to work through the solution, even if I got the final working solution.

tossandthrow

With all that additional work, would you assess you have been more cost effective as just doing these tasks yourself with an Ai companion?

laborcontract

The reason why OP is getting terrible results is because he's using Cursor, and Cursor is designed to ruthlessly prune context to curtail costs.

Unlike the model providers, Cursor has to pay the retail price for LLM usage. They're fighting an ugly marginal price war. If you're paying more for inference than your competitors, you have to choose to either 1) deliver equal performance as other models at a loss or 2) economize by way of feeding smaller contexts to the model providers.

Cursor is not transparent on how it handles context. From my experience, it's clear that they use aggressive strategies to prune conversations to the extent that it's not uncommon that cursor has to reference the same file multiple times in the same conversation just to know what's going on.

My advice to anyone using Cursor is to just stop wasting your time. The code it generates creates so much debt. I've moved on to Codex and Claude and I couldn't be happier.

addandsubtract

What deal is GitHub Copilot getting then? They also offer all SOTA models. Or is the performance of those models also worse there?

laborcontract

Github Copilot is likely running models at or close to cost, given that Azure serves all those models. I haven't used Copilot in several months so I can't speak to its performance. My perception back then was that its underperformance relative to peers was because Microsoft was relatively late to the agentic coding game.

lxgr

I've had agents find several production bugs that slipped me (as I couldn't dedicate enough time to chase down relatively obscure and isolated bug reports).

Of course there are many more bugs they'll currently not find, but when this strategy costs next to nothing (compared to a SWE spending an hour spelunking) and still works sometimes, the trade-off looks pretty good to me.

zwnow

Exactly, the actual business value is way smaller people think and its honestly frustrating. Yes they can write boilerplate, yes they sometimes do better than humans in well understood areas. But its negligible considering all the huge issues that come with them. Big tech vendorlocks, data poisoning, unverifiable information, death of authenticity, death of creativity, ignorance of LLM evangelists, power hungriness in a time where humanity should look at how to decrease emissions, theft of original human work, theft of data big tech gets away with since way too long. Its puzzling to me how people actually think this is a net benefit to humanity.

alwahi

from a cursory (heh) reading of the cursor forum, it is clear that the participants in the chat are treating ai like the adeptus mechanicus treats the omnissiah.... the machine spirits aren't cooperating with them though.

hansmayer

The answer is really trivial and really embarrassingly simple, once you remove the engineering/functional/world improvement goggles. The answer is: because the rich folks invested a ton of money and they need it to work. Or at least to make most of the white collar work dependent on it, quality be damned. Hence the ever increasing pushing, nudging, advertising, offering to use the crap-tech everywhere. It seems now it will not win over the engineers. Unfortunately it seems to work with most of the general population. Every lazy recruiter out there is now using chatgpt to generate job summaries and "evaluate" candidates. Every "office worker" of the general type deadweight you meet at every company is happy to use it to produce more powerpoints, slides and documents for you drown in. And I won't even mention the "content" business model of the influencers.

null

[deleted]

thaumasiotes

This idea is getting a lot of attention right now.

e.g. https://www.noahpinion.blog/p/americas-future-could-hinge-on...

rokkamokka

I find it funny that the page subheader is "If the economy's single pillar goes down, Trump's presidency will be seen as a disaster".

Is it not a disaster already? The fast slide towards autocracy should certainly be viewed as a disaster if nothing else.

thor-rodrigues

I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

The simplest explanation would be “You’re using it wrong…”, but I have the impression that this is not the primary reason. (Although, as an AI systems developer myself, you would be surprised by the number of users who simply write “fix this” or “generate the report” and then expect an LLM to correctly produce the complex thing they have in mind.)

It is true that there is an “upper management” hype of trying to push AI into everything as a magic solution for all problems. There is certainly an economic incentive from a business valuation or stock price perspective to do so, and I would say that the general, non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor.

While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.

Another theory: you have some spec in your mind, write down most of it and expect the LLM to implement it according to the spec. The result will be objectively a deviation from the spec.

Some developers will either retrospectively change the spec in their head or are basically fine with the slight deviation. Other developers will be disappointed, because the LLM didn't deliver on the spec they clearly hold in their head.

It's a bit like a psychological false memory effect where you misremember and/or some people are more flexibel in their expectations and accept "close enough" while others won't accept this.

At least, I noticed both behaviors in myself.

benashford

This is true. But, it's also true of assigning tasks to junior developers. You'll get back something which is a bit like what you asked for, but not done exactly how you would have done it.

Both situations need an iterative process to fix and polish before the task is done.

The notable thing for me was, we crossed a line about six months ago where I'd need to spend less time polishing the LLM output than I used to have to spend working with junior developers. (Disclaimer: at my current place-of-work we don't have any junior developers, so I'm not comparing like-with-like on the same task, so may have some false memories there too.)

But I think this is why some developers have good experiences with LLM-based tools. They're not asking "can this replace me?" they're asking "can this replace those other people?"

sussmannbaka

What I want to see at this point are more screencasts, write-ups, anything really, that depict the entire process of how someone expertly wrangles these products to produce non-trivial features. There's AI influencers who make very impressive (and entertaining!) content about building uhhh more AI tooling, hello worlds and CRUD. There's experienced devs presenting code bases supposedly almost entirely generated by AI, who when pressed will admit they basically throw away all code the AI generates and are merely inspired by it. Single-shot prompt to full app (what's being advertised) rapidly turns to "well, it's useful to get motivated when starting from a blank slate" (ok, so is my oblique strategies deck but that one doesn't cost 200 quid a month).

This is just what I observe on HN, I don't doubt there's actual devs (rather than the larping evangelist AI maxis) out there who actually get use out of these things but they are pretty much invisible. If you are enthusiastic about your AI use, please share how the sausage gets made!

fragmede

https://mitchellh.com/writing/non-trivial-vibing (not me)

KronisLV

> I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

Some possible reasons:

  * different models used by different folks, free vs paid ones, various reasoning effort, quantizations under the hood and other parameters (e.g. samplers and temperature)
  * different tools used, like in my case I've found Continue.dev to be surprisingly bad, Cline to be pretty decent but also RooCode to be really good; also had good experiences with JetBrains Junie, GitHub Copilot is *okay*, but yeah, lots of different options and settings out there
  * different system prompts, various tool use cases (e.g. let the model run the code tests and fix them itself), as well as everything ranging from simple and straightforward codebases that are dime a dozen out there (and in the training data), vs something genuinely new that would trip up both your average junior dev, as well as the LLMs
  * open ended vs well specified tasks, feeding in the proper context, starting new conversations/tasks when things go badly, offering examples so the model has more to go off of (it can predict something closer to what you actually want), most of my prompts at this point are usually multiple sentences, up to a dozen, alongside code/data examples, alongside prompting the model to ask me questions about what I want before doing the actual implementation
  * also sometimes individual models produce output for specific use cases badly, I generally rotate between Sonnet 4.5, Gemini Pro 2.5, GPT-5 and also use Qwen 3 Coder 480B running on Cerebras for the tasks I need done quickly and that are more simple

With all of that, my success rate is pretty great and the statement about the tech not being able to "...barely follow a simple instruction" holds untrue. Then again, most of my projects are webdev adjacent in mostly mainstream stacks, YMMV.

surgical_fire

> Then again, most of my projects are webdev adjacent in mostly mainstream stacks

This is probably the most significant part of your answer. You are asking it to do things for which there are a ton of examples of in the training data. You described narrowing the scope of your requests too, which tends to be better.

lm28469

> The simplest explanation would be...

The simplest explanation is that most of us are code monkeys reinventing the same CRUD wheel over and over again, gluing things together until they kind of work and calling it a day.

"developers" is such a broad term that it basically is meaningless in this discussion

impossiblefork

It's true though, they can't. It really depends on what they have to work with.

In the fixed world of mathematics, everything could in principle be great. In software, it can in principle be okay even though contexts might be longer. When dealing with new contexts in something like real life, but different-- such as a story where nobody can communicate with the main characters because they speak a different language, then the models simply can't deal with it, always returning to the context they're familiar with.

When you give them contexts that are different enough from the kind of texts they've seen, they do indeed fail to follow basic instructions, even though they can follow seemingly much more difficult instructions in other contexts.

Ianjit

"expect an LLM to correctly produce the complex thing they have in mind"

My guess is that for some types of work people don't know what the complex thing they have in mind is ex ante. The idea forms and is clarified through the process of doing the work. For those types of task there is no efficiency gain in using AI to do the work.

ehnto

Well we are all doing different tasks on different codebases too. It's very often not discussed, even though it's an incredibly important detail.

But the other thing is that, your expectations normalise, and you will hit its limits more often if you are relying on it more. You will inevitably be unimpressed by it, the longer you use it.

If I use it here and there, I am usually impressed. If I try to use it for my whole day, I am thoroughly unimpressed by the end, having had to re-do countless things it "should" have been capable of based on my own past experience with it.

lnrd

> I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

My hypothesis is that developers work on different things and while these models might work very well for some domains (react components?) they will fail quickly in others (embedded?). So one one side we have developers working on X (LLM good at it) claiming that it will revolutionize development forever and the other side we have developers working on Y (LLM bad at it) claiming that it's just a fad.

h33t-l4x0r

I think this is right on, and the things that LLM excels at (react components was your example) are really the things that there's just such a ridiculous amount of training data for. This is why LLMs are not likely to get much better at code. They're still useful, don't get me wrong, but they 5x expectations needs to get reined in.

JCM9

Because the hype cycle on the original AI wave was fading so folks needed something new to buzz about to keep the hype momentum going. Seriously, that’s the reason.

berkes

Do you have any reasoning or anything else to back this up? Edit: honest question, not a diss or a dismissal.

It's an interesting take, one that I believe could be true, but it sounds more like an opinion than a thesis or even fact.

JCM9

Folks aren’t seeing measurable returns on AI. Lots written about this. When the bean counters show up, the easiest way to get out of jail is to say “Oh X? Yeah that was last year, don’t worry about it… we’re now focused on Y which is where the impact will come from.”

Every hype cycle goes through some variation of this evolution. As much as folks try to say AI is different it’s following the same very predictable hype cycle curve.

mnky9800n

As a scientist there is a ton of boiler plate code that is just slightly different enough for every data set I need to write it myself each time. So coding agents solve a lot of that. At least until you are halfway through something and you realize Claude didn’t listen when you wrote 5 times in capital letters NEVER MAKE UP DATA YOU ARE NOT ALLOWED TO USE np.random IN PLACE OF ACTUAL DATA. It’s all kind of wild because when it works it’s great and when it doesnt there’s no failure state. So if I put on my llm marketing hat I guess the solution is to have an agent that comes behind the coding agent that checks to see if it does its job. We can call it the Performance Improvement Plan Agent (PIPA). PIPAs allow real time monitoring of coding agents to make sure they are working and not slacking off allowing for HR departments and management teams to have full control over their AI employees. Together we will move into the future.

vigouroustester

While I agree with the sentiment of not just letting it run free on the whole codebase and do what it wants, I still have good experience with letting it do small tasks one at a time, guided by me. Coding ability of models has really improved over the last few months itself and I seem to be clearing less and less AI-generated code mess than I was 5 months ago.

It's got a lot to do with problem framing and prompt imo.

tejtm

Nigh thirty years ago when dabbling in AI I read a quote I will paraphrase as:

"when you hear 'intelligent agent'; think 'trainable ant'"

jstummbillig

Say more?

drittich

Looks like it comes from Scientific American: https://spaf.cerias.purdue.edu/~spaf/Yucks/V5/msg00004.html

jeswin

There are widely divergent views here. It'd be hard to have a good discussion unless people mention what tasks they're attempting and failing at. And we'll also have to ask if those tasks (or categories) are representative of mainstream developer effort.

Without mentioning what the LLMs are failing or succeeding at, it's all noise.

kykat

The replies are all a variation of: "You're using it wrong"

motorest

> The replies are all a variation of: "You're using it wrong"

I don't know what you are trying to say with your post. I mean, if two persons feed their prompts to an agent and while one is able to reach their goals the other fails to achieve anything, would it be outlandish to suggest one of them is using it right whereas the other is using it wrong? Or do you expect the output to not reflect the input at all?

Alex_L_Wood

And yours is also "you are using it wrong" in the spirit.

Are they doing the same thing? Are they trying to achieve the same goals, but fail because one is lacking some skill?

One person may be someone who needs a very basic thing like creating a script to batch-rename his files, another one may be trying to do a massive refactoring.

And while the former succeeds, the latter fails. Is it only because someone doesn't know how to use agentic AI, or because agentic AI is simply lacking?

berkes

And some more variations that, in my anecdotal experience make or break the agentic experience:

* strictness of the result - a personal blog entry vs a complex migration to reform a production database of a large, critical system

* team constraints - style guides, peer review, linting, test requirements, TDD, etc

* language, frameworks - quick node-js app vs a java monolyth e.g.

* legacy - a 12+ year Django app vs a greenfield rust microservice

* context - complex, historical, nonsensical business constraints and flows vs a simple crud action

* example body - a simple crud TODO in PHP or JS, done a million times vs a event-sourced, hexagonal architecrtured, cryptographical signing system for govt data.

ares623

I expect the $500 billion magic machine to be magic. Especially after all the explicit threats to me and my friends livelihoods.

kykat

Of course the output reflects the input, that's why it's a bad idea to let the LLM run in a loop without constraints, it's simple maths, if something is 99% accurate, after 5 times is 95% accurate, after 10 steps it's about 90% accurate, after 100 times it's about 36% accurate.

For LLMs to be effective, you (or something else) needs to constantly find the errors and fix it.

magicalhippo

I've had good experience getting a different LLM perform a technical review, then feed that back to the primary LLM but tell it to evaluate the feedback rather than just blindly accepting it.

You still have to have a hand on the wheel, but it helps a fair bit.

ninetyninenine

I’ve seen LLM catch and fix their own mistakes and literally tell me they were wrong and that they are fixing their self made wrong mistake. This analogy is therefore not accurate as error rate can actually decrease over time.

leptons

In my experience it depends on which way the wind is blowing, random chance, and a lot of luck.

For example, I was working on the same kind of change across a few dozen files. The prompt input didn't change, the work didn't change, but the "AI" got it wrong as often as it got it right. So was I "using it wrong" or was the "AI" doing it wrong half the time? I tried several "AI" offerings and they all had similar results. Ultimately, the "AI" wasted as much time as it saved me.

l1ng0

[dead]

timschmidt

I've certainly gotten a lot of value from adapting my development practices to play to LLM's strengths and investing my effort where they have weaknesses.

"You're using it wrong" and "It could work better than it does now" can be true at the same time, sometimes for the same reason.

ZeWaka

I find it quite funny that one of the users actually posted a fully AI-generated reply (dramatically different grammar and structure than their other posts).

thundoe

Which is true. Like launching a Ferrari at 200mph without steering doesn’t take anyone anywhere, it’s just a very painful waste of money

gwd

Exactly one of two things is true:

1. The tool is capable of doing more than OP has been able to make it do

2. The tool is not capable of doing more than OP has been able to make it do.

If #1 is true, then... he must be using it wrong. OP specifically said:

> Please pour in your responses please. I really want to see how many people believe in agentic and are using it successfully

So, he's specifically asking people to tell him how to use it "right".

fabian2k

For me, a big issue is that the performance of the AI tools varies enormously for different tasks. And it's not that predictable when it will fail, which does lead to quite a bit of wasted time. And while having more experience prompting a particular tool is likely to help here, it's still frustrating.

There is a bit of overlap for the stuff you use agents and the stuff that AI is good at. Like generating a bunch of boilerplate for a new thing from scratch. That makes the agent mode more convenient for me to interact with AI for the stuff it's useful in my case. But my experience with these tools is still quite limited.

ehnto

When it works well you both normalise your expectations, and expand your usage, meaning you will hit its limits, and be even more disappointed when it fails at something you've seen it do well before.

stanac

The problem in this case is that LLMs are bad with golang, I don't write go, I am guessing from my experience with kotlin. I mainly use kotlin (rest apis) and LLMs are often bad at writing it. They e.g. confuse mockk and mockito functions and then agent spiral into a never ending loop of guessing what's wrong and trying to fix it in 5 different ways. Instead I use only chat, validate every output and point out errors they introduce.

On the other hand colleagues working with react and next have better experience with agents.