Skip to content(if available)orjump to list(if available)

AI agents: Less capability, more reliability, please

simonw

Yeah, the "book a flight" agent thing is a running joke now - it was a punchline in the Swyx keynote for the recent AI Engineer event in NYC: https://www.latent.space/p/agent

I think this piece is underestimating the difficulty involved here though. If only it was as easy as "just pick a single task and make the agent really good at that"!

The problem is that if your UI involves human beings typing or talking to you in a human language, there is an unbounded set of ways things could go wrong. You can't test against every possible variant of what they might say. Humans are bad at clearly expressing things, but even worse is the challenge of ensuring they have a concrete, accurate mental model of what the software can and cannot do.

photonthug

> The problem is that if your UI involves human beings typing or talking to you in a human language, there is an unbounded set of ways things could go wrong. You can't test against every possible variant of what they might say.

It's almost like we really might benefit from using the advances in AI for stuff like speech recognition to build concrete interfaces with specific predefined vocabularies and a local-first UX. But stuff like that undermines a cloud-based service and a constantly changing interface and the opportunities for general spying and manufacturing "engagement" while people struggle to use the stuff you've made. And of course, producing actual specifications means that you would have to own bugs. Besides eliminating employees, much interest in AI is all about completely eliminating responsibility. As a user of ML-based monitoring products and such for years.. "intelligence" usually implies no real specifications, and no specifications implies no bugs, and no bugs implies rent-seeking behaviour without the burden of any actual responsibilities.

It's frustrating to see how often even technologists buy the story that "users don't want/need concrete specifications" or that "users aren't smart enough to deal with concrete interfaces". It's a trick.

freeone3000

> concrete interfaces with specific predefined vocabularies and a local-first UX

An app? We don’t even need to put AI in it, turns out you can book flights without one.

photonthug

Tech won't freeze in place exactly where it's at today even if some people want that, and even if in some cases it actually would make sense. And.. if you advocate for this I think you risk losing credibility. Especially amongst technologists it's better to think critically about structural problems with the trends and trajectories. AI is fine, change is fine.. the question now is really more like why and what for and in the interest of whom. To the extent models work locally, we'll be empowered in the end.

Similarly, software eating the world was actually pretty much fine, but SaaS is/was a bit of a trap. And anyone who thought SaaS was bad should be terrified about the moats and platform lock-in that billion dollar models might mean, the enshittification that inevitably follows market dominance, etc.

Honestly we kinda need a new Stallman for the brave new world, someone who is relentlessly beating the drum on this stuff even if they come across as anticorporate and extreme. An extremist might get traction, but a call to preserve things as they are probably cannot / should not.

cyanydeez

I see the AI push as turnkey Wall E future.

Terr_

> for general spying and manufacturing "engagement"

"Oh, there's one tiny feature that management is really really interested in, make the AI gently upsell the user on a higher tier of subscription if an opportunity presents itself."

genewitch

With today's models that means it will pitch the upsell every three sentences. They're happy to comply.

emn13

Perhaps the solutions(s) needs to be less focusing on output quality, and more on having a solid process for dealing with errors. Think undo, containers, git, CRDTs or whatever rather than zero tolerance for errors. That probably also means some kind of review for the irreversible bits of any process, and perhaps even process changes where possible to make common processes more reversible (which sounds like an extreme challenge in some cases).

I can't imagine we're anywhere even close to the kind of perfection required not to need something like this - if it's even possible. Humans use all kinds of review and audit processes precisely because perfection is rarely attainable, and that might be fundamental.

_bin_

The biggest issue I’ve seen is “context window poisoning”, for lack of a better term. If it screws something up it’s highly prone to repeating that mistake. It then makes a bad fix that propagates two more errors, the says, “Sure! Let me address that,” repeating to poorly fix those rather than the underlying issue (say, restructuring code to mitigate.)

It is almost impossible to produce a useful result, far as I’ve seen, unless one eliminates that mistake from the context window.

instakill

I really really wish that LLMs had an "eject" function - as in I could click on any message in a chat, and it would basically start a new clone chat with the current chat's thread history.

There are so many times where I get to a point where the conversation is finally flowing in the way that I want and I would love to "fork" into several directions from that one specific part of the conversation.

Instead I have to rely on a prompt that requests the LLM to compress the entire conversation into a non-prose format that attempts to be as semantically lossless as possible; this sadly never works as in ten did [sic].

PeterStuer

"If it screws something up it’s highly prone to repeating that mistake"

Certainly true, but coaching it past sometimes helps (not always).

- roll back to the point before the mistake.

- add instructions so as to avoid the same path. "Do not try X. We tried X it does not work as it leads to Y.

- add resources that could aid a misunderstanding (api documentation, library code)

- rerun the request (improve/reword with observed details or insights)

I feel like some of the agentic frameworks are already including some of these heuristics, but a helping hand still can work to your benefit

donmcronald

This is what I find. If it makes a mistake, trying to get it to fix the mistake is futile and you can't "teach" it to avoid that mistake in the future.

bongodongobob

I think this is one of the core issues people have when trying to program with them. If you have a long conversation with a bunch of edits, it will start to get unreliable. I frequently start new chats to get around this and it seems to work well for me.

ModernMech

> Perhaps the solutions(s) needs to be less focusing on output quality, and more on having a solid process for dealing with errors. Think undo, containers, git, CRDTs

LLMs are supposed to save us from the toils of software engineering, but it looks like we're going to reinvent software engineering to make AI useful.

Problem: Programming languages are too hard.

Solution: AI!

Problem: AI is not reliable, it's hard to specify problems precisely so that it understands what I mean unambiguously.

Solution: Programming languages!

Workaccount2

With pretty much every new technology, society has bent towards the tech too.

When smartphones first popped up, browsing the web on them was a pain. Now pretty much the whole web has phone versions that make it easier*.

*I recognize the folly of stating this on HN.

techpineapple

But, assuming this is a general thing not just focused on say software development, can you make the tooling around creating this easier than defining the process itself? Everyone loosely speaking sees the value in test driven development, but often I think with complex processes, writing the test is harder than writing the process.

RicoElectrico

I want to make a simple solution where data is parsed by a vision model and "engineer for the unhappy path" is my assumption from the get-go. Changing the prompt or swapping the model is cheap.

herval

vision models are also faulty, and some times all paths are unhappy paths, so there's really no viable solution. Most of the times, swapping the model completely randomizes the problem space (unless you measure every single corner case, it's impossible to tell if everything got better or if some things got worse...

null

[deleted]

dfilppi

[dead]

yujzgzc

I'm old enough to remember having to talk to a (human) agent in order to book flights, and can confirm that in my experience, the modern flight booking website is an order of magnitude better UX than talking to someone about your travel plans.

kccqzy

That still exists. The last time I did onsite interviews, every single company that wanted to fly me to their office to interview me asked me to talk to a human agent to book flights. But of course the human agent is just a travel agent with no budgetary power; so I ended up calling the agent to inquire about a booking, then calling the recruiter to confirm that price is acceptable, and then calling the agent book to confirm the booking.

It doesn't have to be this way. Even before the pandemic I remember some companies simply gave me access to an internal app to choose flights where the only flights shown are these of the right date, right airport, and right price.

toasterlovin

I think what we’ll come to widely realize is that syncing state between two minds (in your example, the travel agent’s mind and your mind; more widely, AI agents and their user’s minds) is extremely expensive and slow and it’s gonna be very hard to make these systems good enough to overcome the super low latency of keeping a task contained to a single mind, your own, and just doing most stuff yourself. The CPU/GPU dichotomy as a lens for viewing the world is widely applicable, IME.

leoedin

Yeah, I much prefer using a well designed self service system than trying to explain it over the phone.

The only problem with most of the flights I book now is that they're with low cost airlines and packed with dark patterns designed to push upgrades.

Would an AI salesman be any better though? At least the website can't actively try to pursuade me to upgrade.

WesolyKubeczek

An AI agent will likely be worse in that you would have to actively haggle with it so it doesn’t upsell you by default, which IMO is harder than circumnavigating the dark patterns.

An actually useful agent is something that is totally doable with technologies even from a decade ago, which you by necessity need to host yourself, with a sizeable amount of DIY and duct tape, since it won’t be allowed to exist as a hosted product. The purveyor of goods and services cannot bargain with it so it puts useless junk into your shopping cart on impulse. You cannot really upsell it, all the ad impressions are lost on it, and you cannot phish it with ad buttons that look like the UI of your site — it goes in with the sole purpose to make your bookings/arrangements, it’s a quick in-and-out. It, by its very definition and design, is very adversarial to how most companies with Internet presences run things.

serjester

Even operator's original demo the first thing they showed was booking restaurant reservations and ordering groceries. I understand their need to demo something intuitive but it's still debatable whether these tasks are ones that most people want delegated to black-box agents.

ToucanLoucan

They don't. I have never once in my life wanted to talk to my smart speaker about what I wanted for dinner, not even because a smart speaker is/can be creepy, not because of social anxiety, no, it's just simpler and more straightforward to open Doordash on my damn phone, and look at a list of restaurants nearby to order from. Or browse a list of products on Amazon to buy. Or just call a restaurant to get a reservation. These tasks are trivial.

And like, as a socially anxious millennial, no I don't particularly like phone calls. However I also recognize that setting my discomfort aside, a direct connection to a human being who can help reason out a problem I'm having is not something easily replaced with a chatbot or an AI assistant. It just isn't. Perfect example: called a place to make a reservation for myself, my wife and girlfriend (poly long story) and found the place didn't usually do reservations on the day in question, but the person did ask when we'd be there. As I was talking to a person, I could provide that information immediately, and say "if you don't take reservations don't worry, that's fine," but it was an off-busy hour so we got one anyway. How does an AI navigate that conversation more efficiently than me?

As a techie person I basically spend the entire day interacting with various software to perform various tasks, work related and otherwise. I cannot overstate: NONE of these interactions, not a single one, is improved one iota by turning it into a conversation, verbal or text-based, with my or someone else's computer. By definition it makes basic tasks take longer, every time, without fail.

bluGill

I've more than once been on a roadtrip and realized that wanted something to help me find a meal where I'll be sometime in the next 2 hours. I have no idea what the options are and I can't find them. All too often I've taken some generic fast food when I really wanted something local but I couldn't get maps to tell me and such things are one street away where I wouldn't see it. (remember too if i'm driving I can't spend time to scroll through a list - but even when I'm navigator the interface I can find in maps isn't good)

Terr_

Agreed, verbally asking for X might make it easier for Aunt "where's the Any key" Tillie to get a solution, but it doesn't necessarily give a better solution for everyone else.

Or, for that matter, solutions you can trust. Remember the pitch for Amazon Dash buttons, where you press it and it maybe-reorders a product for delivery, instantly and sight-unseen? What if the price changed? What if it's not exactly the same product anymore? Wait, did someone else already press it? Maybe I can get a better deal? etc.

Actually, that spurs a random thought: Perhaps some of these smart-speaker ordering pitches land differently if someone is in a socioeconomic class where they're already accustomed to such tasks being done competently by human office-assistants, nannies, etc. Their default expectation might be higher, and they won't need to invest time pinching pennies like the rest of us.

genewitch

Not to detract from your overall message; are there studies that say that millennials have more social anxiety? My wife is 9 months younger than me, and a millennial, whereas I am X. I have no social anxiety, at all - she and our kids do. Like, calling people on the phone requires a sit down and breathing exercises; I'm always the one to "run in to the store", not wanting to attend non-concert related venues that may be crowded.

My parents were way older than boomers, and hers were boomers, so maybe that's it?

Spooky23

It's no different than the old Amazon button thing. I'm not going to automatically pay whatever price Amazon is going to charge to push-button replenish household goods. Especially in those days, where "The World's Biggest" fence would have pretty wild swings in price.

If i were rich enough to have some bot fly me somewhere, I'd have a real-life minion do it for me.

3p495w3op495

Any customer service or tech support rep can tell you that even humans can't always understand what other humans are attempting to say

hansmayer

It's so funny when people try to build robots imitating people. I mean part funny, part tragedy of the upcoming bust. The irony being, we would have been better off with an interoperable flight booking API standard which a deterministic headless agent could use to make perfect bookings every single time. There is a reason current user interfaces stem from a scientific discipline once called "Human-Computer Interaction".

TeMPOraL

It's a business problem, not a tech problem. We don't have a solution you described because half of the air travel industry relies on things not being interoperable. AI is the solution at the limit, one set of companies selling users the ability to show a middle finger to a much wider set of companies - interoperability by literally having a digital human approximation pretending to be the user.

the_snooze

I've been a sentient human for at least the last 15 years of tech advancement. Assuming this stuff actually works, it's only a matter of time before these AI services claw back all that value for themselves and hold users and businesses hostage to one another, just like social media and e-commerce before. https://en.wikipedia.org/wiki/Enshittification

Unless these tools can be run locally independent of a service provider, we're just trading one boss for another.

bluGill

The airlines rely on things not interoperating for you. However their agents interoperate all the time via code sharing. They don't want normal people to do this but if something goes wrong with the airplane you should be on they would prefer you to get there than not.

jatins

But that's the promise of AI, right? You can't put an API on everything for human + technological reasons.

dartos

You can’t put an API on everything because it’d take a ton of time and money to pull that off.

I can’t think of any technological reasons why every digital system can’t have an API (barring security concerns, as those would need to be case by case)

So instead, we put 100s of billions of dollars into statistical models hoping they could do it for us.

It’s kind of backwards.

hansmayer

It is a promise alright :)

doug_durham

Your use of the word "perfect" is doing a lot of heavy lifting. "Perfect" is a word embedded in a high dimensional space whose local maxima are different for every human on the planet.

hansmayer

No, it's just the intuitively perfect that comes to mind in this context, i.e. reliable and guaranteed to produce a safe outcome. Much like Amazon checkout process. I am fine giving my credit card details to near-perfect automatons like that. I will never give it to a statistical model, which may or may not hallucinate the sum it is supposed to enter into an interface built for humans, not computers.

davesque

Yep, and AI agents essentially throw up a boundary blocking the user from understanding the capabilities of the system they're using. They're like the touch screens in cars that no one asked for, but for software.

wiradikusuma

Booking a flight is actually task I cannot outsource to a human assistant, let alone AI. Maybe it's a third-world problem or just me being cheap, but there are heuristics involved when booking flights for a family trip or even just for myself.

Check the official website, compare pricing with aggregator, check other dates, check people's availability on cheap dates. Sometimes I only do the first step if the official price is reasonable (I travel 1-2x a month, so I have expectation "how much it should cost").

Don't get me started if I also consider which credit card to use for the points rewards.

kccqzy

Completely agree! Especially considering that flights for most people are still a large expense, people, especially those in the credit card points game, like to go to great lengths to score the cheapest possible flights.

For example, this person[0] could have simply booked a United flight from the United site for 15k points. Instead the person batch emailed Turkish Airlines booking offices, found the Thai office that was willing to make that booking but required bank transfers in Thai baht to pay taxes, made two more phone calls to Turkish Airlines to pay taxes with a credit card, and in the end only spent 7.5k points for the same trip on United.

This may be an extreme example, but it shows the amount of familiarity with the points system, the customer service phone tree and the actual rules to get cheap flights.

If AI can do all of that, it'd be useful. Otherwise I'll stick to manual booking.

[0]: https://frequentmiler.com/yes-you-can-still-book-united-flig...

Jianghong94

Now THAT's the workflow I'd like to see AI agent automate, streamline and democratize for everybody.

maxbond

If it were available to everybody, it would disappear. This is a market inefficiency that a "trader" with deep knowledge of the structure of this market was able to exploit. But if everyone started doing this, United/Turkish Airlines would see they were losing money and eliminate it. Similar to how airlines have tried to stop people exploiting "hidden cities."

kristjansson

and watch it immediate evaporate or require even more esoteric knowledge of opaque systems?

Persistent mispricings can only exist if the cost of exploitation removes the benefit or constrains the population.

baxtr

There is a really interesting book called Alchemy by Rory Sutherland.

In one chapter he describes his frustration with GPS based navigation apps. I thought it was similar to what you describe.

> If I am commuting home, I may prefer a slower route that avoids traffic jams. (Humans, unlike GPS devices, would rather keep moving slowly than get stuck in stop-start traffic.) GPS devices also have no notion of trade-offs, in particular relating to optimising ‘average, expected journey time’ and minimising ‘variance’ – the difference between the best and the worst journey time for a given route.

For instance, whenever I drive to the airport, I often ignore my GPS. This is because what I need when I’m catching a flight is not the fastest average journey, but the one with the lowest variance in journey time – the one with the ‘least-bad worst-case scenario’. The satnav always recommends that I travel there by motorway, whereas I mostly use the back roads.

zippergz

I have HAD a human assistant who booked flights for me. But it took them a long time to learn the nuances of my preferences enough to do it without a lot of back and forth. And even then, they still sometimes had to ask. Things like what time of day I prefer to fly based on what I had going on the day before or what I'll be doing after I land. What airlines I prefer based on which lounges I'd have access to, or what aircraft they fly. When I would opt for a connecting flight to get a better price vs. when I want nonstop regardless of cost. And on and on. Probably dozens of factors that might come into play in various combinations depending on where I'm going and why. And preferences that are hard to articulate, but make sense once understood.

With a really excellent human assistant who deeply understood my brain (at least the travel related parts of it), it was kind of nice. But even then there were times when I thought it would be easier and better to just do it myself. Maybe it's a failure of imagination, but I find it very hard to see the path from today's technology to an AI agent that I would trust enough to hand it off, and that would save enough time and hassle that I wouldn't prefer to just do it myself.

sneak

Off topic, but I’m curious: how did you go about finding an assistant that good?

victorbjorklund

I don't really need an AI agent to book flights for me (I just don't travel enough for it to be any burden) but aren't those arguments for an AI agent? If you just wanna book the next flight London to New York it isn't that hard. A few minutes of clicking.

But if you wanna find the cheapest way to get to A, compare different retailers, check multiple peoples availability, calculate effects of credit cards etc. It takes time. Aren't those things that could be automated with an agent that can find the cheapest flights, propose dates for it, check availability etc with multiple people via a messing app, calculate which credit card to use, etc?

bgirard

In theory, yes. But in a real world evaluation would it pick better flights? I'd like to see evidence that it's able to find a better flight that maximizes this. Also the tricky part is how do you communicate how much I personally weight a shorter flight vs points on my preferred carrier vs having to leave for the airport at 5am vs 8am? I'm sure my answers would differ from wiradikusuma's answers.

UncleMeat

Yep this is my vibe.

When I'm picking out a flight I'm looking at, among other things:

* Is the itinerary aggravatingly early or late

* Is the layover aggravatingly short or long

* Is the layover in an airport that sucks

* Is the flight on a carrier that sucks

* What does it cost

If you asked me to encode ahead of time the relative value of each of these dimensions I'd never be able to do it. Heck, the relative value to me isn't even constant over time. But show me five options and I can easily select between them. A clear case where search is more convenient than some agent doing it for me.

Jianghong94

Yep that's what I've been thinking. This shouldn't be that hard, at this point LLMs should already have all the 'rules' (e.g. credit card A buys flight X give you m point which can be converted into n miles) in their params or can easily query the web to get it out. Dev need to encode the whole thing into a decision mechanism and once executed ask LLM to chase down the specific path (e.g. bombard ticket office with emails).

antihipocrat

And what happens to the 1% where this fails? At the moment the responsibility is on the person. If I incorrectly book my flight for date X, and I receive the itinerary and realise I chose the wrong month - then damn, I made a mistake and will have to rectify.

An LLM could organise flights with a lower error rate, however, when it goes wrong what is the recourse? I imagine it's anger and a self-promise never to use AI for this again.

*If you're saying that the AI just supplies suggestions then maybe it's useful. Though wouldn't people still be double checking everything anyway? Not sure how much effort this actually saves?

pton_xd

> Booking a flight is actually task I cannot outsource to a human assistant, let alone AI.

Because there is no "correct" flight. Your preference changes as you discover information about what's available at a given time and price.

The helpful AI assistant would present you with options, you'd choose what you prefer, it would refine the options, and so on, until you make your final selection. There would be no communication lag as there would be with a human assistant. That sounds very doable to me.

joseda-hg

The Flight Price to Tolerable Layover time ratio is something too personal for me to convey to an assistant

csomar

It is not that you can't outsource it, but there are so many variables that once you finished explaining them to the assistant (human or AI) you'd be better off doing it yourself. A human assistant only makes sense if he/she is your everything assistant and has knowledge about your work schedule, life, kids, financials, etc...

amogul

I feel the same way, or at least I wouldn't delegate this unless they fine tune accuracy and reliability in their apps. Right now, it sits around 40-60%

extr

The problem I find in many cases is that people are restrained by their imagination of what's possible, so they target existing workflows for AI. But existing workflows exist for a reason: someone already wanted to do that, and there have been countless man-hours put into the optimization of the UX/UI. And by definition they were possible before AI, so using AI for them is a bit of a solution in search of a problem.

Flights are a good example but I often cite Uber as a good one too. Nobody wants to tell their assistant to book them an Uber - the UX/UI is so streamlined and easy, it's almost always easy enough to just do it yourself (or if you are too important for that, you probably have a private driver already). Basically anything you can do with an iPhone and the top 20 apps is in this category. You are literally competing against hundreds of engineers/product designers who had no other goal than to build the best possible experience for accomplishing X. Even if LLMs would have been helpful a priori - they aren't after every edge case has already been enumerated and planned for.

lolinder

> You are literally competing against hundreds of engineers/product designers who had no other goal than to build the best possible experience for accomplishing X.

I think part of what's been happening here is that the hubris of the AI startups is really showing through.

People working on these startups are by definition much more likely than average to have bought the AI hype. And what's the AI hype? That AI will replace humans at somewhere between "a lot" and "all" tasks.

Given that we're filtering for people who believe that, it's unsurprising that they consciously or unconsciously devalue all the human effort that went into the designs of the apps they're looking to replace and think that an LLM could do better.

arionhardison

> I think part of what's been happening here is that the hubris of the AI startups is really showing through.

I think it its somewhat reductive to assign this "hubris" to "AI startups". I would posit that this hubris is more akin to the superiority we feel as human beings.

I have heard people say several times that they "treat AI like a Jr. employee", I think that within the context of a project AI should be treated based on the level if contribution. If AI is the expert, I am not going to approach it as if I am an SME that knows exactly what to ask. I am going to try and focus on the thing. know best, and ask questions around that to discover and learn the best approach. Obviously there is nuance here that is outside the scope of this discussion, but these two fundamentally different approaches have yield materially different outcomes in my experience.

hexasquid

Treat AI like a junior employee?

Absolutely not. When giving tasks to an AI, we supply them with context, examples of what to do, examples of what not to do, and we clarify their role and job. We stick with them as they work and direct them accordingly when something goes wrong.

I've no idea what would happen if we treated a junior developer like that.

arionhardison

> The problem I find in many cases is that people are restrained by their imagination of what's possible, so they target existing workflows for AI.

I concur and would like to add that they are also restrained by the limitations of existing "systems" and our implicit and explicit expectations of said system. I am currently attempting to mitigate the harm done by this restriction by focusing on and starting with a first principal analysis of the problem being solved before starting the work, for example; lets take a well established and well documented system like the SSA.

When attempting to develop, refactor, extend etc... such a system; what is the proper thought process. As I see it, there are two paths:

Path 1:

  a) Breakdown the existing workflows

  b) Identify key performance indicators (KPIs) that align with your business goals

  c) Collect and analyze data related to those KPIs using BPM tools

  d) Find the most expensive worst performing workflows

  e) Automate them E2E w/ interface contracts on either side
This approach locks you into to existing restrictions of the system, workflows, implementation etc...

Path 2:

  a) Analyze system to understand goal in terms of 1st principals, e.g: What is the mission of the SSA? To move money based on conditional logic.

  b) What systems / data structures are closest to this function and does the legacy system reflect this at its core e.g.: SSA should just be a ledger IMO

  c) If Yes, go to "Path 1" and if No go to "D"

  d) Identify the core function of the system, the critical path (core workflow) and all required parties

  e) Make MVP which only does the bare min
By following path 2 and starting off with an AI analysis of the actual problem and not the problem as it exist as a solution within the context of an existing system, it is my opinion that the previous restrictions have been avoided.

Note: Obviously this is a gross oversimplification of the project management process and there are usually external factors that weigh in and decide which path is possible for a given initiative, my goal here was just to highlight a specific deviation from my normal process that has yielded benefits so far in my own personal experience.

peterjliu

We've (ex Google Deepmind researchers) been doing research in increasing the reliability of agents and realized it is pretty non-trivial, but there are a lot of techniques to improve it. The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks. We made our own benchmarks to measure progress.

Plug: We just posted a demo of our agent doing sophisticated reasoning over a huge dataset ((JFK assassination files -- 80,000 PDF pages): https://x.com/peterjliu/status/1906711224261464320

Even on small amounts of files, I think there's quite a palpable difference in reliability/accuracy vs the big AI players.

ai-christianson

> The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks.

OMFG thank you for saying this. As a core contributor to RA.Aid, optimizing it for SWE-bench seems like it would actively go against perf on real-world tasks. RA.Aid came about in the first place as a pragmatic programming tool (I created it while making another software startup, Fictie.) It works well because it was literally made and tested by making other software, and these days it mostly creates its own code.

Do you have any tips or suggestions on how to do more formalized evals, but on tasks that resemble real world tasks?

peterjliu

I would start by making the examples yourself initially, assuming you have a good sense for what that real-world task is. If you can't articulate what a good task is and what a good output is, it is not ready for out-sourcing to crowd-workers.

And before going to crowd-workers (maybe you can skip them entirely) try LLMs.

ai-christianson

> I would start by making the examples yourself initially

What I'm doing right now is this:

  1) I have X problem to solve using the coding agent.
  2) I ask the agent to do X
  3) I use my own brain: did the agent do it correctly?
If the agent did not do it correctly, I then ask: should the agent have been able to solve this? If so, I try to improve the agent so it's able to do that.

The hardest part about automating this is #3 above --each evaluation is one-off and it would be hard to even formalize the evaluation.

SWE bench, for example uses unit tests for this, and the agent is blind to the unit tests --so the agent has to make a red test (which it has never seen) go green.

gcp123

I've spent the last six months building a coding agent at work, and the reliability issues are killing us. Our users don't want 'superhuman' results 10% of the time - they want predictable behavior they can trust.

When we tried the 'full agent' approach (letting it roam freely through our codebase), we ended up with some impressive demos but constant production incidents. We've since pivoted to more constrained workflows with human checkpoints, and while less flashy, user satisfaction has gone way up.

The Cursor wipeout incident is a perfect example. It's not about blaming users who don't understand git - it's about tools that should know better. When I hand my code to another developer, they understand the implied contract of 'don't delete all my shit without asking.' Why should AI get a pass?

Reliable > clever. It's the difference between a senior engineer who delivers consistently and a junior who occasionally writes brilliant code but breaks the build every other week."

joshdavham

My rule of thumb has thus far been: if I’m gonna allow AI to write any bit of code for me, then I must, at a bare minimum, be able to understand that code.

There’s no way I could do what some of these “vibe coders” are doing where they allow AI to write code for them that they don’t even understand.

AlexandrB

I think there's a lot of code that gets written that's either disposable or effectively "write only" in that no one is expected to maintain it. I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain.

Basically, what's worse? "Vibes" code that no one understands or a cascade of 20 spreadsheets that no one understands? At least with the "vibes" code you can stick it in git and have some semblance of sane revision control and change tracking.

pton_xd

> I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain.

That sort of makes sense, but then again... if you run some analysis code and it spits out a few plots, how do you know what you're looking at is correct if you have no idea what the code is doing?

kibwen

> how do you know what you're looking at is correct if you have no idea what the code is doing?

Does it reaffirm the biases of the one who signs my paychecks? If so, then the code is correct.

cube00

> I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain

Considering the hallucinations we've all seen I don't know how they can be comfortable using AI generated data analysis to drive the future direction of the business.

palmotea

> think there's a lot of code that gets written that's either disposable or effectively "write only" in that no one is expected to maintain it. I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain.

> Basically, what's worse? "Vibes" code that no one understands or a cascade of 20 spreadsheets that no one understands?

Correction: it's a "cascade of 20 spreadsheets" that one person understood/understands.

Write only code still needs to work, and someone at some point needs to understand it well enough to know that it works.

Centigonal

> I have friends who write a lot of this code for tasks like data analysis for retail and "vibe coding" isn't that crazy in such a domain.

I think this is a great use case for AI, but the analyst still needs to understand what the code that is output does. There are a lot of ways to transform data that result in inaccurate or misleading results.

LPisGood

Vibe coders focus on writing tests, and verifying function/correctness. It’s not like they don’t read _any_ of the code. They get the vibes, but ignore the details.

inetknght

> what's worse? "Vibes" code that no one understands or a cascade of 20 spreadsheets that no one understands? At least with the "vibes" code you can stick it in git and have some semblance of sane revision control and change tracking.

You can for spreadsheets too.

liveoneggs

two wrongs don't make a right

SkyPuncher

Sure, but you're a professional software engineer, who I assume gets feedback and performance reviews based on the quality of your code.

There's always been a group of beginners that throws stuff together without fully understanding what it does. In the past, this would be copy n' paste from Stackoverflow. Now, that process is simply more automated.

__jochen__

There is also likely to be increased pressure in a SE job to produce more code. You'll find that if others use AI, it'll be hard to be a hold-out and hit fewer delivery milestones, and quality is hard to measure. People are rewarded for shipping, primarily (unless you're explicitly working on high reliability/assurance products).

__MatrixMan__

I think there are times where it's ok to treat a function like a black box--cases where anything that makes the test pass will do because the test is in fact an exhaustive evaluation of what that code needs to do.

We just need to be better about making it clear which code is that way and which is not.

kevmo314

That's only true as long as you want to modify said code. If it meets your bar for reliability then you won't need to understand it, much like how we don't really need to read/understand compiled assembly code so we largely trust the compiler.

A lot of these vibe coders just have a much lower bar for reliability than you.

joshdavham

This is an interesting point and it's certainly true with respect to most peoples' attitudes towards dependencies.

For example, while I feel the need to understand the code I wrote using pytorch, I don't generally feel the need to totally grok how pytorch works.

fourside

How do you know if it meets your bar for reliability if you don’t understand the output? I don’t know that the analogy to a compiler is apples to apples. A compiler isn’t producing an answer based on statistically generating something that should look like the right answer.

kevmo314

The premise for vibe coding is that it's generating the entire app or site. If the app does what you want then it's meeting the bar.

twotwotwo

FWIW, work has pushed use of Cursor and I quickly came around to a related conclusion: given a reliability vs. anything tradeoff, you more or less always have to prefer reliability. For example, even ignoring subtle head-scratcher type bugs, a faster model's output on average needs more revision before it basically works, and on average you end up spending more energy on that than you save by reducing time to first response. Up-front work that decreases the chance of trouble--detailing how you want something done, explicitly pulling into context specific libraries--also tends to be worth it on net, even if the agent might have gotten there by searching (or you could get it there through follow-up requests).

That's my experience working with a largeish mature codebase (all on non-prod code) where you can't get far if you can't use various internal libraries correctly. With standalone (or small greenfield) projects, where results can lean more on public info from pre-training and there's not as much project specific info to pull in, you might see different outcomes.

Maybe the tech and surrounding practice will change over time, but in my short experience it's mostly been about trying to just get to 'acceptable' for this kind of task.

getnormality

"Less capability, more reliability, please" is what I want to say about everything that's happened in the past 20 years. Of everything that's happened since then, I'm happy to have a few new capabilities: smartphones, driving directions, cloud storage, real-time collaborative editing of documents. I don't need anything else. And now I just want my gadget batteries to last longer, and working parental controls on my kids' devices.

danso

I think the replies [0] to the mentioned reddit thread sums up my (perhaps complacent?) feelings about the current state of automated AI programming:

> Does it terrify anyone else that there is an entire cohort of new engineers who are getting into programming because of AI, but missing these absolute basic bare necessities?

> > Terrify? No, it's reassuring that I might still have a place in the world.

[0] https://www.reddit.com/r/cursor/comments/1inoryp/comment/mdo...

bob1029

The reddit post feels like engagement bait to me.

Why would you ask the community a question like "how to source control" when you've been working with (presumably) a programming genius LLM that could provide the most personally tailored path for baby's first git experience? Even if you don't know that "git" is a thing, you could ask questions as if you were a golden retriever and the model would still inevitably recommend git in the first turn of conversation.

Is it really the case that a person who has the ability to use a compiler, IDE, LLM, web browser, reddit, etc., somehow simultaneously lacks the ability to frame basic-ass questions about the very mission they set out on? If stuff like this is not manufactured, then we should all walk away feeling pretty fantastic about our future job prospects.

namaria

If you start from scratch trying to build an ideal system to program computers, you always converge on the time tested tooling that we have now. Code, compilers, interpreters, versioning, etc.

People think "this is hard, I'll re-invent it in an easier way" and end up with a half-assed version of the tooling we've honed over the decades.

mycall

> People think "this is hard, I'll re-invent it in an easier way" and end up with a half-assed version of the tooling we've honed over the decades.

This is a win in the long run because the occassional and successful thought people labor over sometimes is a better way.

danso

The account is a throwaway but based on its short posting history and its replies, I don't have reason to believe it's a troll:

https://www.reddit.com/r/cursor/comments/1inoryp/comment/mdr...

> I'm not a dev or engineers at all (just a geek working in Finance)

This fits my experience of teaching very intelligent students how to code; if you're an experienced programmer, you simply cannot fathom the kinds of assumptions beginners will make due to gaps in yet-to-be foundational knowledge. I remember having to tell students to mindful when searching Stack Overflow for help, because of how something as simple as an error from Requests (e.g. while doing web scraping) could lead them down a rabbit hole of "solutions" such as completely uninstalling their Python for a different/older version of Python.

layer8

They were using Cursor, not a general LLM, and were asking their fellow Cursor users how they deal with the risk of Cursor destroying the code base.

dfxm12

Google Flights already nails this UX perfectly

Often when using an AI agent, I think to myself that a web search gets me what I need more reliably and just as quick. Maybe AI has to learn to crawl before it learns to walk, but each agent I use is leaving me without confidence that it will ever be useful and I genuinely wonder if they've ever been tested before being published...

monero-xmr

Assume humans can do anything in a factory. So we create a tool to increase the speed and reliability of the human’s output. We do this so much that eventually the whole factory is automated, and the human is simply observing.

Nowhere in that story above is there a customer or factory worker feeding in open-ended inputs. The factory is precise, it takes inputs and produces outputs. The variability is restricted to variability of inputs and the reliability of the factory kit.

Much business software is analogous to the factory. You have human workers who ultimately operate the business. And software is built to automate those tasks precisely.

AI struggles because engineers are trying to build factories through incantation - if they just say the right series of magic spells, the LLM will produce a factory.

And often it can. It’s just a shitty factory that does simple things, often inefficiently with unforeseen edge cases.

At the moment, skilled factory builders (software engineers) are better at holistically understanding the needs of the business and building precise, maintainable, specific factories.

The factory builders will use AI as a tool to help build better factories. Trying to get the AI to build the whole factory soup-to-nuts won’t work.

bhu8

I have been thinking about the exact same problem for a while and was literally hours away from publishing a blogpost on the subject.

+100 on the footnote:

> agents or workflows?

Workflows. Workflows, all the way.

The agents can start using these workflows once they are actually ready to execute stuff with high precision. And, by then we would have figured out how to create effective, accurate and easily diagnozable workflows, so people will stop complaining about "I want to know what's going on inside the black box".

DebtDeflation

I've been building workflows with "AI" capability inserted where appropriate since 2016. Mostly customer service chatbots.

99.9% of real world enterprise AI use cases today are for workflows not agents.

However, "agents" are being pushed because the industry needs a next big thing to keep the investment funding flowing in.

The problem is that even the best reasoning models available today don't have the actual reasoning and planning capability needed to build truly autonomous agents. They might in a year. Or they might not.

breckenedge

Agreed, I started crafting workflows last week. Still not impressed with how poorly the current crop of models is at following instructions.

And are there any guidelines on how to manage workflows for a project or set of projects? I’m just keeping them in plain text and including them in conversations ad hoc.

SkyPuncher

Unfortunately, the picked example kind of weighs down the point. Cursor has an extremely vocal minority (beginner coders) that isn't really representative of their heavy weight users (professional coders). These beginner users face significant issues that come from being new to programming, in general. Cursor gives them amazing capabilities, but it also lets them make the same dumb mistakes that most professional developers have done once or twice in their career.

That being said, back in February I was trying out of bunch of AI personal assistant apps/tools. I found, without fail, every single one of them was advertising features their LLMs could theoretically accomplish, but in practice couldn't. Even worse was many of these "assistants" would proactively suggest they could accomplish something but when you sent them out to do it, they'd tell you they couldn't.

* "Would you like me to call that restaurant?"...."Sorry, I don't have support for that yet"

* "Would you like me to create a reminder?"....Created the reminder, but never executed it

* "Do you want me to check their website?"...."Sorry, I don't support that yet"

Of all of the promised features, the only thing I ended up using any of them for was a text message interface to an LLM. Now that Siri has native ChatGPT support, it's not necessary.