Thoughts on a month with Devin

193 comments

·January 17, 2025

rbren

I'm one of the creators of OpenHands (fka OpenDevin). I agree with most of what's been said here, wrt to software agents in general.

We are not even close to the point where AI can "replace" a software engineer. Their code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp. I've talked to companies who went all in on AI engineers, only to realize two months later that their codebase was rotting because no one was reviewing the changes.

But once you develop some intuition for how to use them, software agents can be a _massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself. I especially love asking it to do simple, tedious things like fixing merge conflicts or failing linters. It's great at getting an existing PR over the line.

It's also important to keep in mind that these agents are literally improving on a _weekly_ basis. A few weeks ago we were at the top of the SWE-bench leaderboard; now there are half a dozen agents that have pulled ahead of us. And we're one launch away from leapfrogging back to the top. Exciting times!

https://github.com/All-Hands-AI/OpenHands

jebarker

> code still needs to be reviewed and tested, at least as much as you'd scrutinize the code of a brand new engineer just out of boot camp

> ..._massive_ boost to productivity. ~20% of the commits to the OpenHands codebase are now authored or co-authored by OpenHands itself.

I'm having trouble reconciling these statements. Where does the productivity boost come from since that reviewing burden seems much greater than you'd have if you knew commits were coming from a competent human?

lars512

There's often a lot of small fixes that not time efficient to do, but a solution is not much code and is quick to verify.

If the cost is small to setting a coding agent (e.g. aider) on a task, seeing if it reaches a quick solution, and just aborting if it spins out, you can solve a subset of these types of issues very quickly, instead of leaving them in issue tracking to grow stale. That lets you up the polish on your work.

That's still quite a different story to having it do the core, most important part of your work. That feels a little further away. One of the challenges is the scout rule, the refactoring alongside change that makes the codebase nicer. I feel like today it's easier to get a correct change that slightly degrades codebase quality, than one that maintains it.

jebarker

Thanks - this all makes sense - I still don't feel like this would constitute a massive productivity boost in most cases, since it's not fixing time consuming major issues. But I can see how it's nice to have.

lolinder

I haven't started doing this with agents, but with autocomplete models I know exactly what OP is talking about: you stop trying to use models for things that models are bad at. A lot of people complain that Copilot is more harm than good, but after a couple of months of using it I figured out when to bother and when not to bother and it's been a huge help since then.

I imagine the same thing applies to agents. You can waste a lot of time by giving them tasks that are beyond them and then having to review complicated work that is more likely to be wrong than right. But once you develop an intuition for what they can and cannot do you can act appropriately.

drewbug01

I suspect that many engineers do not expend significant energy on reviewing code; especially if the change is lengthy.

linsomniac

>burden seems much greater than...

Because the burden is much lower than if you were authoring the same commit yourself without any automation?

jebarker

Is that true? I'd like to think my commits are less burdensome to review than a fresh out of boot camp junior dev especially if all that's being done is fixing linter issues. Perhaps there's a small benefit, but doesn't seem like a major productivity boost.

sureglymop

My biggest issue is just how often these agents make subtle, hard to spot mistakes.

It can seem great for certain tasks at first. Yesterday I had to add license headers to the top of a lot of source code files. The reason why I let the AI try is because the repository contained lots of different programming languages.

It was able to do this but I then realized that it just removed the last sentence of the text it was supposed to add.

bufferoverflow

We've seen exponential improvements in LLM's coding abilities. Went from almost useless to somewhat useful in like two years.

Claude 3.5 is not bad really. I wanted to do a side project that has been on my mind for a few years, and Claude coded it in like 30 seconds.

So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.

Zanfa

> So to say "we are not even close" seems strange. If LLMs continue to improve, they will be comparable to mid level developers in 2-3 years, senior developers in 4-5 years.

These sorts of things can’t be extrapolated. It could be 6-months, it could be a local maxima / dead end that’ll take another breakthrough in 10 years like transformers were. See self-driving cars.

barrell

I think the most you could say is we’ve had improvements - from gpt 4 to whatever the current model is has definitely not been exponential improvements.

My experience is acctually they’ve become dramatically less helpful over the past two years (past year in particular). Claude seems not to have backslid much but it’s progression also has not been very fast at all (I’ve noticed no difference since the 3.5 launch despite several updates).

Everything grows sinusoidally and I feel we’re well past the tipping point into diminishing rate of improvement

veggieroll

What does the cost look like for running OpenHands yourself? From your docs, it looks like you recommend Sonnet @ $3 / million tokens. But I could imagine this can add up quickly if you are sending large portions of the repository at a time as context.

xmprt

I think one of the big problems with Devin (and AI agents in general) is that they're only ever as good as they are. Sometimes their intelligence feels magical and they accomplish things within minutes that even mid level or senior software engineers would take a few hours to do. Other times, they make simple mistakes and no matter how much help you give, they run around in circles.

A big quality that I value in junior engineers is coachability. If an AI agent can't be coached (and it doesn't look like it right now), then there's no way I'll ever enjoy using one.

ipnon

My first job I spent so much time reading Python docs, and the ancient art of Stack Overflow spelunking. But I could intuitively explain a solution in seconds because of my CS background. I used to encounter a certain kind of programmer often, who did not understand algorithms well but had many years of experience with a language like Ruby, and thus was faster in completing tasks because they didn't need to do the reference work that I had to do. Now I think these kinds of programmers will slowly disappear and only the ones with the fast CS intuition will remain.

joduplessis

I've found the opposite true as well.

halfmatthalfcat

I disagree. If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).

If anything, my gut says that the CS concepts are very easy for LLMs to recall and will be the first things replaced (if ever) by AI. Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.

There's also the meme in the industry that self-taught, non-CS degree engineers are potentially of the most capable group. Though this is anecdotal.

ben_w

> If anything, CS degrees have proven time and time again they aren't translatable into software development (which is why there's an entire degree field called Software Engineering emerging).

Emerging? I graduated in 2006 with a BEng in Software Engineering.

The difference between it and the BSc CompSci degree I started in, was that optional modules became mandatory — including an industrial placement year (paid internship).

> Software engineer{ing,s} (project construction, integrations, scaling, organizational/external factors, etc) will stick around for along time.

My gut disagrees, because LLMs are at about the same level in those things as they are in low level coding: not yet replacing humans in project level tasks any more than they do in coding tasks, but also being OK assistants for both coding and project domains. I have no reason to think either has less or more opportunity for self-training, so I expect progress to track for the foreseeable future.

(That said, the foreseeable future in this case is 1-2 years).

viraptor

> the CS concepts are very easy for LLMs to recall

They're easy to recall, but you have to know what to recall in the first place. Or even know enough of the territory to realise there's something to recall. Without enough background, you'll get a whole set of amazing tools that you have no idea what to do with.

For example, you may be able to write a long description of your problem with some ideas how to steer the AI to give you possible solutions. And the AI may figure out what the problem is and that the hyperloglog is something that could be useful to you. And you may have the awesome programming skills to implement that. But that's a lot of maybes. It would be much faster/easier if you knew about hyperloglog ahead of time and just asked for the implementation or library recommendation.

Or even if you don't know about the actual solution, you'd have enough of CS vocabulary to ask: "how do I get a fast, approximate distinct count from a multiset". It would take a long imprecise description to get the same thing for a coder with no theory background.

cmiles74

I'm not convinced an LLM is really "recalling" any CS concepts when they try to solve a problem. IMHO, we're lucky if it matches the pattern of the request against the pattern of a solution and the two are actually related. I'm no expert but I don't think there's any reason to think that an LLM is taking a CS concept and applying it to something novel in order to get a solution. If they were, I believe their success rate would be much higher.

In many places where someone might reach for something they remember from their CS coursework, there's often an open-source library or tool doing much the same thing. Understanding how these libraries and tools function is certainly valuable but, much of the time, people can get by with only a vague hunch; indeed, this is why they exist! IMHO, I would be happier with the LLM assistant if it picked reliable library code rather than writing up a sketchy solution of its own.

I'm also familiar with this idea people who have managed to be successful in the field without a CS degree are more capable. In my opinion, this is hogwash. I think if we take a step back, we'll see that people graduating from established, top-tier CS programs are looking for higher pay than those who have come from a less expensive and (very often) business focused program. To be fair, people from each of these backgrounds has their strengths; in many organizations a developer who has done two semesters of accounting is a real benefit, in others the ability to turn a research paper into the prototype of a new product is going to be important.

Years of experience often washes out much of these differences. People who have started from business oriented education programs may end up taking a handful of CS courses as they navigate their career, likewise many people with a CS background end up accruing business centered skills.

In my opinion, people start out their education at a place that they can afford, a place that is familiar to them, often a place that they feel comfortable. Someone's economic background (and of their family) plays a big role on what kind of educational program they choose when they are on the cusp of adulthood. Smart and talented people can always learn what they need to learn, regardless of their starting point.

jpc0

I think honestly the meme that non-CS degree engineers are most capable is selection bias.

If they had taken a CS degree they would likely be just as, of not more capable.

To self-learn the topics you need to make good software takes an immense amount of effort and although the data and material is out there, takes a lot of work to figure out.

I'm only recently starting to pick up on "magic" patterns that are actually extremely simple to understand given the right base knowledge... I can gain tons of insights from talks givem in the early 2010s but if I watched them without the correct practical experience and foundational knowledge it is the same as the title to a HN post this week[1], gibberish.

With the correct time playing with the foundational patterns and learning some of the backing knowledge it unlocks amazing patterns in my mind and makes the magic seem simple. A great example, CSP[2]. I've known about and used the actor model before, which I first discovered when I found Erlang, but now with CSP I could ask the question "Why should actors be heavy?", you can put an actor into a light-weight task and spawn tons of them and build a tree of connections. Stuff like oneTBB flow graph[3]now makes sense and looks like a beautiful pattern with some really interesting ideas that can be implemented in more general computing than the high performance computing it was designed for. It seems niche but golang is built on those foundations, and the true power of concurrency in golang comes from embracing that. It fundamentally changes the way I want to structure and layout code and I feel like a good CS course can get you there quicker...

Unfortunately a good CS course probably wouldn't accelerate the average CS grads understanding of that but can get someone dedicated and hungry there much much quicker. Someone fresh out of a JS bootcamp is maybe a decade away from that if they ever even want to search for that knowledge.

1. https://news.ycombinator.com/item?id=42711751

2. https://en.m.wikipedia.org/wiki/Communicating_sequential_pro...

3. https://oneapi-spec.uxlfoundation.org/specifications/oneapi/...

marcyb5st

I completely agree with you. More precisely, I feel they are useful when you have specific tasks with limited scope.

For instance, just yesterday I was battling with a complex SQL query and I got halfway there. I gave our bot the query and an half assed description of what I wanted/what was missing and it got it right on the first try.

datadrivenangel

Are you sure that your SQL query is correct?

xboxnolifes

Can they be sure even if they wrote it theirself?

QuadmasterXLII

he’s certainly sure, but lord knows if it is

kkaatii

And when working with people it's fairly easy to intervene and improve when needed. I think the current working model with LLMs is definitely suboptimal when we cannot confine their solution space AND where they should apply a solution precisely, and timely.

llamaimperative

It’s also often possible to know what a human will be bad at before they start. This allows you to delegate tasks better or vary the level of pre-work you do before getting started. This is pretty unpredictable with LLMs still.

null

[deleted]

null

[deleted]

CGamesPlay

As someone who uses AI coding tools daily and has done a fair amount of experimentation with different approaches (though not Devin), I feel like this tracks pretty well. The problem is that Devin and other "agentic" approaches take on more than they can handle. The best AI coders are positioned as tools for developers, rather than replacements for them.

Github Copilot is "a better tab complete". Sure, it's a neat demo that it can produce a fast inverse square root, but the real utility is that it completes repetitive code. It's like having a dynamic snippet library always available that I never have to configure.

Aider is the next step up the abstraction ladder. It can edit in more locations than just the current cursor position, so it can perform some more high-level edit operations. And although it also uses a smarter model than Copilot, it still isn't very "smart" at the end of the day, and will hallucinate functions and make pointless changes when you give it a problem to solve.

frereubu

When I tried Copilot the "better tab complete" felt quite annoying, in that the constantly changing suggested completion kept dragging my focus away from what I was writing. That clearly doesn't happen for you. Was that something you got used to over time, or did that just not happen for you? There were elements of it I found useful, but I just couldn't get over the flickering of my attention from what I was doing to the suggested completions.

Edit: I also really want something that takes the existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions. Does Copilot do that now?

macNchz

I tried to get used to the tab completion tools a few times but always found it distracting like you describe. often I’d have a complete thought, start writing the code, get a suggested completion, start reading it, realize it was wrong, but then I’d have lost my initial thought, or at least have to pause and bring myself back to it.

I have, however, fully adopted chat-to-patch style workflows like Aider, I find it much less intrusive and distracting than the tab completions, since I can give it my entire thought rather than some code to try to complete.

I do think there’s promise in more autonomous tools, but they still very much fall into the compounding-error traps that agents often do at the present.

CGamesPlay

I have the automatic suggestions turned off. I use a keybind to activate it when I want it.

> existing codebase in the form of a VSCode project / GitHub repo and uses that as a basis for suggestions

What are you actually looking for? Copilot uses "all of github" via training, and your current project in the context.

frereubu

> I have the automatic suggestions turned off. I use a keybind to activate it when I want it.

I didn't realise you could do that. Might give it another go.

> Copilot uses "all of github" via training, and your current project in the context.

The current project context is the bit I didn't think it had. Thanks!

wrsh07

For cursor you can chat and ask @codebase and it will do rag (or equivalent) to answer your question

goosejuice

Copilot is also very slow. I'm surprised people use it to be honest. Just use Cursor.

pindab0ter

Cursor requires you to use their specific IDE though, doesn't it? With Copilot I don't have to switch contexts as it lives in my Jetbrains IDE.

mattnewton

I would try cursor. It’s pretty good at copy pasting the relevant parts of the codebase in and out of the chat window. I have the tab autocomplete disabled.

Aeolun

Cursor tab does that. Or at least, it takes other open tabs into account when making suggestions.

sincerely

i’ve been very impressed with the gemini autocomplete suggestions in google colab, and it doesn’t feel more/less distracting than any IDEs built in tab suggestions

verdverm

I think a lot of people who are enabling copilot in vs code (like I did a few days ago), are experiencing "suggested autocomplete as I type" for the first time where before there was no grey text below what I am writing personally.

It is a huge distraction, especially if it changes as I write more. I turned it off almost immediately.

I deeply regret turning on copilot in vscode. It (M$) immediately weaseled into so many places and settings. I'm still trying to scaled it back. Super annoying and distracting. I'd prefer a much more opt in for each feature than what they did.

the_af

> The best AI coders are positioned as tools for developers, rather than replacements for them.

I agree with this. However, we must not delude ourselves and understand that corporate is pushing for replacement. So there will be a big push to improve on tools like Devin. This is not a conspiracy theory, in many companies (my wife's, for example) they are openly stating this: we are going to reduce (aka "lay off") the engineering staff and use as much AI solutions as possible.

I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. Not everyone can be a top of the cream specialist. And it'll be used to drive down salaries, too.

lolinder

I remember when I was first getting started in the industry the big fear of the time was that offshoring was going to take all of our jobs and drive down the salaries of those that remained. In fact the opposite happened: it was in the next 10 years that salaries ballooned and tech had a hiring bubble.

Companies always want to reduce staff and bad companies always try to do so before the solution has really proven itself. That's what we're seeing now. But having deep experience with these tools over many years, I'm very confident that this will backfire on companies in the medium term and create even more work for human developers who will need to come in and clean up what was left behind.

(Incidentally, this also happened with offshoring— many companies ended up with large convoluted code bases that they didn't understand and that almost did what they wanted but were wrong in important ways. These companies needed local engineers to untangle the mess and get things back on track.)

senordevnyc

But having deep experience with these tools over many years, I'm very confident...

No one has had deep experience with these tools for any amount of time, let alone many years. They're literally just now hitting the market and are rapidly expanding their capabilities. We're at a fundamentally different place than we were just twelve months ago, and there's no reason to think 2025 will be any different.

the_af

I think it's qualitatively different this time.

Unlike with offshoring, this is a technological solution, which understandably is received more enthusiastically on HN. I get it. It's interesting as tech! And it's achieved remarkable things. But unlike with offshoring (which is a people thing) or magical NOCODE/CASE/etc "solutions", it seems the consensus is that AI coding assistants will eventually get there. At least a portion of even HN seems to think so. And some are cheering!

The coping mechanism seems to be "it won't happen to me" or "my knowledge is too specialized" but I think this will become increasingly false. And even if your knoweldge is too specialized to be replaced by AI, most engineers aren't like that. "Well, become more specialized" is unrealistic advice, and in any case, the employment pool will shrink.

PS: I am offhsoring (in a way). I'm not based in the US but I work remotely for a US company.

nyarlathotep_

> I wonder how many of us, here, understand that many jobs are going away if/when this works out for the companies. And the usual coping mechanism, "it will only be for low hanging fruit", "it will never happen to me because my $SKILL is not replaceable", will eventually not save you. Sure, if you are a unique expert on a unique field, but many of us don't have that luxury. And it'll be used to drive down salaries, too.

Yeah it's maddening.

The cope is bizarre too: "writing code is the least important part of the job"

Ok then why does nearly every company make people write code for interviews or do take home programming projects?

Why do people list programming languages on their resumes if it's "least important"?

Also bizarre to see people cheering on their replacements as they use all this stuff.

s1mplicissimus

> Ok then why does nearly every company make people write code for interviews or do take home programming projects?

For the same reason they put leetcode problems to "test" an applicants skill. Or have them write mergesort on a chalkboard by hand. It gives them a warm fuzzy feeling in the tummy because now they can say "we did something to check they are competent". Why, you ask? Well it's mostly impossible to come up with a test to verify a competency you don't have yourself. Imagine you can't distinguish red and green, are not aware of it, but want to hire people who can. That's their situation, but they cannot admit it - because it would be clear evidence that they are no good fit for their current role. Use this information responsibly ;)

> Why do people list programming languages on their resumes if it's "least important"?

You put the programming languages in there alongside the HR-soothing stuff because you hope that an actual software person gets to see your resume and gives you an extra vote for being a good match. Notice that most guides recommend a relatively small amount of technical content vs. lots of "using my awesomeness i managed to blafoo the dingleberries in a more efficient manner to earn the company a higher bottom line"

If you don't want to be a software developer that's fine. But your questions point me towards the conclusion that you don't know a lot of things about software development in the first place which doesn't speak for your ability to estimate how easy it will be to automate it using LLMs.

dimitri-vs

Is spending 4 years of your life on education that will likely only be 10-20% applicable to your job any less bizarre? It's just another hoop employers want to see you capable of jumping.

If you ignore the syntax programming is just writing detailed instructions. Just because AI is able to translate English to code doesn't mean the 100s of decisions that need to be made go away. Someone still needs to write very detailed instructions even if they are in English and it sure isn't going to be the people sitting in meetings all day.

And let's pretend that I can now be 10x more productive with AI. Great, now I can ship 10x more features in the same timeframe and nothing changes - the development backlog is literally infinite. There are always more features or bugs to work on.

qup

It's weird to talk about aider hallucinating.

That's whatever model you chose to use with it. Aider can use any.l model you like.

davedx

One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?

One of the more important features of agents is supposedly that they can stop and ask for human input when necessary. It seems it does do this for "hard stops" - like when it needed a human to setup API keys in their cloud console - but for "soft stops" it wouldn't.

By contrast, a human dev would probably throw in the towel after a couple of hours and ask a senior dev for guidance. The chat interface definitely supports that with this system but apparently the agent will churn away in a sort of "infinite thinking loop". (This matches my limited experience with other agentic systems too.)

coffeebeqn

LLMs can create infinite worlds in the error message it’s receiving. It probably needs some outside signal to stop and re-assess. I don’t think LLMs have any ability to reason if they’re lost in their own world on their own. They’ll just keep creating new less and less coherent context for themselves

someothherguyy

If you correct an LLM based agent coder, you are always right. Often, if you give it advice, it pretends like it understands you, then goes on to do something different from what it said it was going to do. Likewise, it will outright lie to you telling you it did things it didn't do. (In my experience)

rsynnott

So when people say these things are like junior developers, they really mean that they’re like the worst _stereotype_ of junior developers, then?

davedx

For sure - but if I'm paying for a tool like Devin then I'd expect the infrastructure around it to do things like stop it if it looks like that has happened.

What you often see with agentic systems is that there's an agent whose role is to "orchestrate", and that's the kind of thing the orchestrator would do: every 10 minutes or so, check the output and elapsed time and decide if the "developer" agent needs a reality check.

mousetree

How would it decide if it needs a reality check? Would the thing checking have the same limitations?

tobyhinloopen

You can maybe have a supervisor AI agent trigger a retry / new approach

nejsjsjsbsb

They need impatience!

verdverm

I think training it to do that would be the hard part.

- stopping is probably the easy part

- I assume this happens during RLFH phase

- Does the model simply stop or does it ask a question?

- You need a good response or interaction, depending on the query? So probably sets or decision trees of them, or agentic even? (chicken-egg problem?)

- This happens 10s of thousands of times, having humans do it, especially with coding, is probably not realistic

- Incumbents like M$ with Copilot may have an advantage in crafting a dataset

csomar

> One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?

You are over-estimating the sophistication of their platform and infrastructure. Everyone was talking about Cursor (or maybe was it astroturfing?) but once I checked it out, it was not far from avante on neovim.

a1j9o94

Cursor isn't designed to do long running tasks. As someone mentioned in another comment it's closer to a function call than a process like Devin.

It will only do one task at a time that it's asked to do.

dimitri-vs

...for now.

They are pushing in this direction with the Composer Agent mode which can carry out a sequence of multi-file changes without you having to specify the files. It's pretty decent. If you're feeling brave there is also a beta "YOLO" mode that will auto approve these changes and run console commands.

ImHereToVote

There should be an energy coefficient to problems. You only get a set amount of energy to solve per issue. When the energy runs out. A human must help.

rfoo

Devin does ask for help when it can't do something. I think I have it asked me how to use a testing suite it had trouble running.

The problem is it really really hate asking for help if it had a skill issue, it would prefer running in circles than admitting it just can't do something.

Aeolun

So they perfectly nailed the junior engineer. It’s just that that isn’t what people are looking for.

rfoo

Maybe. It's pretty weird and I'm still thinking about it.

You can't throw junior engineers working on an issue under the bus when they clearly can't do that. Or at least it takes some effort. In return you may coach them and hope they eventually improves.

Devin does look like junior engineers, but I've learned to just click "Terminate Session" immediately after I spotted that it was doing something hopeless. I've managed to get some real work done out of it, without much effort on my side (just check what it's doing every 10~15 minutes and type a few lines or restart session).

mkagenius

If they had built that from the beginning people would have said "every other tasks it asks me for help, how is it a developer then if I have to assist it all the time?"

But now since you are okay with that, I think it's the right time to add that feature.

bot403

You can set a "max work time" before it pauses so it wont go for days endlessly spending your credits. By default its set to 10 credits.

So I'm not sure how the author got it to go for days.

rco8786

I'm sure a lot of folks in these comments predicted these sorts of results with surprising accuracy.

Stuff like this is why I scoff when I hear about CEOs freezing engineering hiring or saying they just don't need mid-level engineers anymore because they have AI.

I'll start believing that when I see it happening, and see actual engineers saying that AI can replace a human.

I am long AI, but I think the winning formula is small, repetitive tasks with a little too much variation to make it worth it (or possible) to automate procedurally. Pulling data from Notion into Google sheets, like these folks did initially, is probably fine. Having it manage your infrastructure and app deployments, likely not.

noodletheworld

Those “how I feel about Devin after using it” comments at the bottom are damning, when you compare them to the user testimonials of people using cursor.

Seems to me that agents just aren’t the answer people want them to be, just a hype wave obscuring real progress in other areas (eg. MCST) because they’re easy to implement.

…but really, if things are easy to implement, at this point, you have to ask why they haven’t been done yet.

Probably, it seems, because it’s harder to implement in a way that’s useful than it superficially appears…

Ie. If the smart folk working on Devin can only do something of this level, anyone working on agentic systems should be worried, because it’s unlikely you can do better, without better underlying models.

null

[deleted]

freddref

How is Devin different from cursor?

I recently used cursor and it has felt very capable in implementing tasks across files. I get that cursor is an IDE but it's ai functionality feels very agentic.. where do you draw the line?

jondwillis

Cursor Composer (both "normal" and "agent" mode) fit the colloquial definition of agent, for sure.

Xmd5a

I had to look up MCST: it means Model-Centric Software Tools, as opposed to autonomous agents.

Devin is closer to a long-running process that you can interact with as it is processing tasks, whereas Cursor is closer to a function call: once you've made the call, the only think you can do is wait for the result.

noodletheworld

It stands for Monte Carlo search tree.

Ie. Better outputs from models, not external tooling and prompt engineering.

https://github.com/zz1358m/MCTS-AHD-master

Melomomololo

Agents are really new and would solve plenty of annoying things.

When I code with Claude, I have to copy paste files around.

But everything we do in AI is new and outdated a few weeks ago.

Claude is really good but blocks you in 1-3h for a bit due to context length.

That type of issues will be solved.

And local coding models are super fast on a 4090 already. Imagine a small project digits on your desktop were you allow these models also more thinking. But the thinking style models again are super new.

Things probably are not done yet because we humans are the bottleneck right now. Getting enough chips, energy, standards, training time, doing experiments with tech a while tech b starts to emerge from another corner of ai.

5090 just was announced and depending on benchmarks it might be 1.x-3 times faster. if it's faster above 1.5 that would again be huge.

llamaimperative

Have you used Cursor, which GP actually refers to?

ianbutler

Disclosure: Working on a company in the space and have recently been compared to Devin in at least one public talk.

Devin has tried to do too much. There is value in producing a solid code artifact that can be handed off for review to other developers in limited capacities like P2s and minor bugs which pile up in business backlogs.

Focusing on specific elements of the development loop such as fix bugs, add small feature, run tests, produce pull request is enough.

Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.

yoavm

Not to take away from your opinion, but I guess time will tell? As models get better, it's possible that wide tools like Devin will work better and swallow tools that do one thing. I think companies much rather have a AI solution that works like what they already know (developers), than one that works in the IDE, another that watches to Github issues, another that reviews PRs, and one that hangs on Slack and makes small fixes.

> Businesses like Factory AI or my own are taking that approach and we're seeing real interest in our products.

Interest isn't what tools like Devin are lacking, (un)fortunately.

To be clear, I do share a lot of scepticism regarding all the businesses working around AI code generation. However, that isn't because I think they'll never be able to figure it out, but because I think they are all likely to figure it out at the end, at the same time, when better models come out. And none of them will have a real advantage over the other.

ianbutler

I've recently had several enterprise level conversations with different companies and what we're being asked for is specifically the simpler approach. I think that is the level of risk they're willing to tolerate and it will still ameliorate a real issue for them.

The key here is my product is no worse positioned to do more things if and when the time comes, but building a solid foundation and trust, and not having the quiet part be (which I heard as early as several months ago) that your product doesn't work means we'll hopefully still have the customer base to roll that out to.

I've talked to Devin's CEO once at Swyx's conference last June, they're very thoughtful and very kind so this must be very rough but between when they showed their demo then and what I'm hearing now the product has not evolved in a way where they are providing value commensurate with their marketing or hype.

I'm a fan of Guillermo Rauch's (Vercel CEO) take on these things. You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.

Devin's investment was fueled by hyperspeculation early on when no one knew what the shape of the game was. In many ways we still don't, but if you burn your reputation before we get there you may not be able to capitalize on it.

To be completely fair to them, taking the long view and the bank account to go with it they may still be entirely fine.

likium

> You earn the right to take on bigger challenges and no one in this space has earned the right yet including us.

Not entirely. We're in interesting times where products with better models can suddenly leapfrog and displace even current upstarts. Cursor won over Copilot from leveraging Claude Sonnet 3.5. They didn't "earn the right".

Improvements with models will help those with the existing infrastructure that can benefit from it. I'm not saying Devin will win when that time comes, but a similar product might find their space quickly.

morgante

You can get a much higher hit rate with more constrained agents, but unfortunately if it's too constrained it just doesn't excite people as much.

Ex. the Grit agent (my company) is designed to handle larger maintenance tasks. It has a much higher success rate, with <5% rejected tasks and 96% merged PRs (including some pretty huge repos).

It's also way less exciting. People want the flashy tool that can solve "everything."

npilk

This feels a bit like AI image generation in 2022. The fact that it works at all is pretty mindblowing, and sometimes it produces something really good, but most of the time there are obvious mistakes, errors, etc. Of course, it only took a couple more years to get photorealistic image outputs.

A lot of commenters here seem very quick to write off Devin / similar ideas permanently. But I'd guess in a few years the progress will be remarkable.

One stubborn problem – when I prompt Midjourney, what I get back is often very high-quality, but somehow different than what I expected. In other words, I wouldn't have been able to describe what I wanted, but once I see the output I know it's not quite right. I suspect tools like this will run into similar issues. Maybe there will be features that can help users 'iterate' quickly.

muglug

> Of course, it only took a couple more years to get photorealistic image outputs.

"Photorealistic" is a pretty subjective judgement, whereas "does this code produce the correct outputs" is an objective judgement. A blurry background character with three arms might not impact one's view of a "photorealistic" image, but a minor utility function returning the wrong thing will break a whole program.

dimitri-vs

If were comparing Devin to image generation, then Devin would be a version of Midjourney where you have no prompting skills, you only get one image and if you want something different you can only use the remix feature to make changes, oh and with each change the image resolution goes up and you get more jpeg artifacts.

tlarkworthy

Also trialed Devin, it's quite impressive when it understands the code formatting and local test setup, producing well formatted and test case passing code, but it seems to always add extraneous changes beyond the task that can break other things. And it can't seem to undo those changes if you ask. So everything requires more cleanup. Devin opened my eyes to the power of agentic workflows with closed loop feedback, and the coolness of a slack interface, but I am gonna recommend cancelling it because it's not actually saving time and it's quite expensive.

huijzer

I’ve used Cursor a lot and the conclusion doesn’t surprise me. I feel like I’m the one *forcing* the system in a certain direction and sometimes an LLM gives a small snippet of useful code. Sometimes it goes in the wrong direction and I have to abort the suggestion and force it into another direction. For me, the main benefit is having a typing assistant which can save me from typing one line here and there. Especially refactorings is where Cursor shines. Things like moving argument order around or adding/removing a parameter at function callsites is great. Saved me a ton of typing and time already. I’m way more comfortable just quickly doing a refactoring when I see one.

kromem

Weird. I have such a different experience with Cursor.

Most changes occur with a quick back and forth about top level choices in chat.

Followed with me grabbing appropriate interfaces and files for context so Sonnet doesn't hallucinate API, and then code that I'll glance over and around half the time suggest one or more further changes.

It's been successful enough I'm currently thinking of how to adjust best practices to make things even smoother for that workflow, like better aggregating package interfaces into a single file for context, as well as some notes around encouraging more verbose commenting in a file I can provide as context as well on each generation.

Human-centric best practices aren't always the best fit, and it's finally good enough to start rethinking those for myself.

cootsnuck

This! I've been using Cursor regularly since late 2023. It's all about building up effective resources to tactfully inject into prompts as needed. I'll even give it sample API responses in addition to API docs. Sometimes I'll have it first distill API docs down into a more tangible implementation guide and then save that as a file in the codebase.

I think I'm just a naturally verbose person by default, and I'm starting to think that has been very helpful in me getting a lot out of my use of LLMs and various LLM tools over the past 2+ years.

I treat them like the improv actors they are and always do the up front work to create (with their assistance) the necessary broader context and grounding necessary for them to do their "improv" as accurately as possible.

I honestly don't use them with the immediate assumption I'll save time (although that happens almost all the time), I use them because they help me tame my thoughts and focus my efforts. And that in and of itself saves me time.

huijzer

Interesting. What project are you working on? For me it's writing a library in Rust.

kgilpin

This is what’s needed to get the most out of these tools. You understand deeply how the tool works and so you’re able to optimize its inputs in order to get good results.

This puts you in the top echelon of developers using AI assisted coding. Most developers don’t have this deep of an understanding and so they don’t get results as good as yours.

So there’s a big question here for AI tool vendors. Is AI assisted coding a power tool for experts, or is it a tool for the “Everyman” developer that’s easy to use?

Usage data shows that the most adopted AI coding tool is still ChatGPT, followed by Copilot (even if you’d think it’s Cursor from reading HN :-))

epolanski

I'll add few things at which Cursor with Claude is better than us (at least in time/effort):

- explaining code. Enter some legacy part of your code nobody understands, LLMs aren't limited to keeping few things in memory like us. Even if the code is very obfuscated and poorly written it can understand what it does and the purpose and suggest refactors to make it understandable

- explaining and fixing bugs. Just the other day Antirez posted a bug of him debugging a Redis segfault on some C code providing context and stack trace. This might be a hit or miss at times, but more often than not it saves you hours

- writing tests. It often comes up with many more examples and edge cases than I thought of. If it doesn't, you can always ask it to.

In any case I want to stress that LLMs are only as good as your data and prompts. They lack the nuance of understanding lots of context, yet I see people talking to them like humans that understand the business, best practices and others.

moffkalast

That first one has always felt super crazy to me, I've figured out what lots of "arcane magic, don't touch" type of functions genuinely do since LLMs have become a thing.

Even if it's slightly wrong it's usually at least in the right ballpark so it gives you a very good starting point to work from. Almost everything is explainable now.

epolanski

I can relate, I have been genuinely amazed more than once by how it could "understand" some very complex code nobody dared to touch like you mention.

sherburt3

Agreed, AI has been a godsend for trying to understand snippets of perl code in our codebase that were basically unreadable before unless you were an expert.

timrichard

I think the .cursorrules and .cursorignore files might be useful here.

Especially the .cursorrules file, as you can include a brief overview of the project and ground rules for suggestions, which are applied to your chat sessions and Cmd/Ctrl K inline edits.

qouteall

For moving argument order and removing parameter is already doable by a mature IDE and they are more predictable than AI (Jetbrains IDEs support them well. In VSCode it may need extensions.)

But adding parameter is not well supported by IDE as it requires knowing which value to pass. This is where Cursor can shine.

falcor84

So for anyone who doubted SWE-BENCH's relevance's to typical tasks, it seems that its stated 13.86% almost exactly matches this 3 successes out of 20 pilot outcome.

We're not quite there yet, but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction. I would now expect pretty much the textbook disruptive innovation process over the next decade or so, until the typical human dev role is pushed to something more akin to the responsibilities of current day architects and product managers. QA engineering though will likely see a big resurgence.

toyetic

>> but all of these issues seem to be solvable with current tech by applying additional training and/or "agentic" direction.

Can you explain why you think this. From what I gather from other comments it seems like if we continue on current trajectory at best you'd still need a dev who understands the projects context to work in tandem w/ the agent so the code doesn't devolve into slop.

falcor84

> so the code doesn't devolve into slop

As I see it, this is pretty much a given across all codebases, with a natural tendency of all projects to become balls of mud if the developer(s) don't actively strive to address technical debt and continuously refactor to address the new needs. But having said that, my experience is that for a given task in an unfamiliar codebase, an AI agent is already better at maintaining consistency than a typical junior developer, or even a mid-level developer who recently joined the team. And when explicitly given the task of refactoring the codebase while keeping the tests passing, the AI agents are already very capable.

The biggest issue, which is what you may be alluding to, is that AI agents are currently very bad at recognizing the limits of their capabilities and continue trying an approach when a human dev would have long since given up and went to their lead to ask for help or for the task specification to be redefined. That's definitely an issue, but I don't see any fundamental technological limitation here, but rather something addressable via an engineering effort.

In general, I've seen so many benchmarks fall to AI in the recent decade (including SWE-BENCH), that now I'm quite confident that if a task being performed by humans can be defined with clear numerical goals, then it's achievable by AI.

And another way I'm looking at it is that for any specific knowledge work competency, it seems to already be much easier and time effective to train an AI to do well on it than to create a curriculum for humans to learn it and then to have every single human to go through it.

DaveMcMartin

This only reinforces my bias against AI agents. At this point, they are mostly just hype. I believe that for AI to replace a junior, we would need to achieve at least near-AGI, and we are far from that.

eitland

If by hype you mean that there isn't extreme real world value right here and right now, then I very much disagree.

Closing in on 20 years since I left school and for me AI is absolutely useful, right here and right now. It is really a bicycle for the mind:

It allows me to get much faster to where I want. (And like bicycles you will get a few crashes early on and possibly later as well, depending on how fast you move and how careful you are.)

I might be in some sweet spot where I am both old enough to know what is going on without using an AI but also young enough to pick up the use of AI relatively effortlessly.

If however by hype you mean that people still have overhyped expactations about the near future, then yes, I agree more and more.

marginalia_nu

I feel AI can also do simple monotonous coding tasks, but I don't think programming is something it's currently very good at. Samples, yes, trivial programs, sure, but anything non-trivial and it's rarely useful.

Where it really shines today is getting humans up to speed with new technologies, things that are well understood in general but maybe not well understood by you.

Want to say build a window manager in X11, despite never having worked with X11 before? Sure, Claude will point you in the right direction and give you a simple template to work with in 30 seconds. Enormous time saver compared to figuring out how to do that from scratch.

Never touched node in your life but want to build a simple electron app? Sure, here's how you get started. Few hours and several follow up questions later, you're comfortable and productive in the environment.

Getting off the ground with new technologies is so much easier with AI it's kind of ridiculous. The revolutionary part of AI coding is how it makes it much easier to be a true generalist, capable of working in any environment with any technology, whatever is appropriate.

thegeomaster

Exactly. LLMs are gullible. They will believe anything you tell them, including incorrect things they have told themselves. This amplifies errors greatly, because they don't have the capacity to step back and try a different approach, or introspect why they failed. They need actual guidance from somebody with much common sense; if let loose in the world, they mostly just spin around in circles because they don't have this executive intelligence.

falcor84

A regular single-pass LLM indeed cannot step back, but newer ones like o1/o3/Marco-o1/QwQ can, and a larger agentic system composed of multiple LLMs definitely can. There is no "fundamental" limitation here. And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models), the sky's the limit. I'd be very bullish about Deepmind, once they fully enter this race.

thegeomaster

> And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models),

Agree with this totally.

I wouldn't call what the CoT models are doing exactly being able to step back - their "stepping back" still dumps tokens into the output, so it is still burdened with seeing all of these failed attempts as it searches for the right one. But my intuition on this can be wrong, and it's a much more advanced reasoning process than what "last-gen" (non-CoT) does, so I can see your point.

For an agentic system composed of multiple LLMs, I would strongly disagree if the LLMs are last-gen. In my experience, it is very hard to prompt a non-CoT LLM into rejecting an upstream assumption without making it paranoid and rejecting valid assumptions as well. This makes it hard to effectively create a robust agentic system that can self-correct.

I think that's different if the agents are o1-level, but I think it's hard to appreciate just how costly and slow doing this would be. Agents consume tokens like candy with all the back-and-forth, so a surprising number of tasks become economically infeasible.

(It seems everyone is waiting for an inference perf breakthrough that may or may not come.)